|
Sandra Gollin Kies Department of Languages and Literature Benedictine University |
Daniel Kies Department of English College of DuPage |
Cameron Nazer Mozafari Department of English U of Maryland, College Park |
The following tips are summarized from Developing a Linguistic Corpus: a guide to good practice by John Sinclair.
Developing linguistic corpora: a guide to good practice
http://ota.ox.ac.uk/documents/creating/dlc/index.htm
In-depth advice from experts in the field of corpus linguistics
Practical guide to building a corpus (appendix to chapter 1 above)
A corpus is a collection of pieces of language text in electronic form selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research (Sinclair).
It depends on the purpose of the corpus. A very large corpus like the COCA runs to over 400,000,000 words, can be used to make generalizations about the language, and can serve as a reference corpus for smaller corpora.
A small corpus of a few thousand words, such as a corpus of student essays or teacher-student spoken interactions, while not generalizable, can provide very useful information to a researcher or teacher, especially when compared against a larger reference corpus.
There are many large and smaller corpora available online (see examples above), so it may not be necessary to collect your own texts. Where you want access to rare or highly specialized texts, you may need to build your own.
For example, Biber et.al. (2004) collected a large corpus of university language, spoken and written. Speakers (in student study groups, lectures etc.) wore microphones for a set period, and the speech events were transcribed as text files. Documents such as syllabi and textbooks were collected and converted to text files.
Daniel Kies collected an 8 million word corpus of first year composition essays (with student permission) over a 30 year period. The earlier texts had to be retyped from paper copies. Later texts were submitted electronically, and were easier to process).
Lack of context. Need to rely on co-text and inter-text (repeated occurrences across many different independent texts). (Stubbs 2001 in Swales 2006 p. 24)