Sandra Gollin Kies
Department of Languages and Literature
Benedictine University



Daniel Kies
Department of English
College of DuPage



Cameron Nazer Mozafari
Department of English
U of Maryland, College Park



Corpus-Based Approaches to Writing: CCCC 2016 Workshop



Methods

Tips on developing a corpus

The following tips are summarized from Developing a Linguistic Corpus: a guide to good practice by John Sinclair.

Developing linguistic corpora: a guide to good practice

http://ota.ox.ac.uk/documents/creating/dlc/index.htm

In-depth advice from experts in the field of corpus linguistics

Practical guide to building a corpus (appendix to chapter 1 above)

http://ota.ox.ac.uk/documents/creating/dlc/appendix.htm

What is a corpus?

“A corpus is a collection of pieces of language text in electronic form selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” (Sinclair).

What a corpus is not.

What is the optimum size of a corpus?

It depends on the purpose of the corpus. A very large corpus like the COCA runs to over 400,000,000 words, can be used to make generalizations about the language, and can serve as a reference corpus for smaller corpora.

A small corpus of a few thousand words, such as a corpus of student essays or teacher-student spoken interactions, while not generalizable, can provide very useful information to a researcher or teacher, especially when compared against a larger reference corpus.

Finding  texts for analysis

There are many large and smaller corpora available online (see examples above), so it may not be necessary to collect your own texts. Where you want access to rare or highly specialized texts, you may need to build your own.

For example, Biber et.al. (2004) collected a large corpus of university language, spoken and written. Speakers (in student study groups, lectures etc.) wore microphones for a set period, and the speech events were transcribed as text files. Documents such as syllabi and textbooks were collected and converted to text files.

Daniel Kies collected an 8 million word corpus of first year composition essays (with student permission) over a 30 year period. The earlier texts had to be retyped from paper copies. Later texts were submitted electronically, and were easier to process).

Preparing your texts for analysis

Some difficulties and limitations of corpus linguistics

Lack of context. Need to rely on co-text and inter-text (repeated occurrences across many different independent texts). (Stubbs 2001 in Swales 2006 p. 24)

Next, let's continue by reviewing some examples.