Corpus-Based Approaches to Writing: CCCC 2016 Workshop

Methods

Tips on developing a corpus

The following tips are summarized from Developing a Linguistic Corpus: a guide to good practice by John Sinclair.

Developing linguistic corpora: a guide to good practice

http://ota.ox.ac.uk/documents/creating/dlc/index.htm

In-depth advice from experts in the field of corpus linguistics

Practical guide to building a corpus (appendix to chapter 1 above)

http://ota.ox.ac.uk/documents/creating/dlc/appendix.htm

What is a corpus?

“A corpus is a collection of pieces of language text in electronic form selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” (Sinclair).

What a corpus is not.

The World wide Web (ever-changing, dimensions unknown, scope of sample is not clear)
An archive (not purposefully gathered – selected after the fact)
A collection of citations (not complete texts)
A collection of quotations (not complete texts)

What is the optimum size of a corpus?

It depends on the purpose of the corpus. A very large corpus like the COCA runs to over 400,000,000 words, can be used to make generalizations about the language, and can serve as a reference corpus for smaller corpora.

A small corpus of a few thousand words, such as a corpus of student essays or teacher-student spoken interactions, while not generalizable, can provide very useful information to a researcher or teacher, especially when compared against a larger reference corpus.

Finding texts for analysis

There are many large and smaller corpora available online (see examples above), so it may not be necessary to collect your own texts. Where you want access to rare or highly specialized texts, you may need to build your own.

For example, Biber et.al. (2004) collected a large corpus of university language, spoken and written. Speakers (in student study groups, lectures etc.) wore microphones for a set period, and the speech events were transcribed as text files. Documents such as syllabi and textbooks were collected and converted to text files.

Daniel Kies collected an 8 million word corpus of first year composition essays (with student permission) over a 30 year period. The earlier texts had to be retyped from paper copies. Later texts were submitted electronically, and were easier to process).

Preparing your texts for analysis

Make a secure copy of the original source texts.
Convert all texts to plain text format (.txt file extension)
Clean up each file, removing personal identifying information and extraneous matter, for example, Works Cited lists might not be wanted at the end of student essays).
Keep each text in a separate file, and give it a unique code.
Make a list of the codes, linked to the source texts, and provide any contextual notes you may need with this list (e.g. age, gender, first language of speakers/writers, genre, sub-topic etc.)
Make a backup copy of all your cleaned files, and store it separately (e.g. remotely in the cloud, or on a thumb drive).

Some difficulties and limitations of corpus linguistics

Lack of context. Need to rely on co-text and inter-text (repeated occurrences across many different independent texts). (Stubbs 2001 in Swales 2006 p. 24)

Sandra Gollin Kies Department of Languages and Literature Benedictine University	Daniel Kies Department of English College of DuPage	Cameron Nazer Mozafari Department of English U of Maryland, College Park
Corpus-Based Approaches to Writing: CCCC 2016 Workshop