Corpus-Based Approaches to Writing: CCCC 2016 Workshop

Corpora

Free corpora

Corpora freely searched online. Each comes with its own search engine/interface and with different features. Some of the websites offer searches in more the one corpus.

Probably the most comprehensive list of large corpora comes to us through the Brigham Young University site: http://corpus.byu.edu/

English	# words	language/dialect	time period	compare
Hansard Corpus (British Parliament)	1.6 billion	British	1803-2005	Info
Wikipedia Corpus (with virtual corpora)	1.9 billion	English	-2014	Info
Global Web-Based English (GloWbE)	1.9 billion	20 countries	2012-13
Corpus of Contemporary American English (COCA)	520 million	American	1990-2015	* * * * *
Corpus of Historical American English (COHA)	400 million	American	1810-2009	* *
TIME Magazine Corpus	100 million	American	1923-2006
Corpus of American Soap Operas	100 million	American	2001-2012	*
British National Corpus (BYU-BNC)*	100 million	British	1980s-1993	* *
Strathy Corpus (Canada)	50 million	Canadian	1970s-2000s

Oxford English dictionary: http://www.oed.com/ (not free, but freely accessible through many academic libraries for students and faculty)

Oxford makes the OED available online, at http://www.oed.com/ by subscription or on CDROM. It remains the authoritative source of lexical, semantic, phonetic and etymological information on the English language.

PIE (Phrases in English): http://phrasesinenglish.org/

Web interface based on BNC phrases. Search for frequently co-occurring words lengths of 2 to 8 words (word clusters). Search all clusters of a particular length or clusters containing a particular word, phrase, or part of speech. Cluster lists with frequency statistics, and key word in context (KWIC) concordances of the clusters.

WebCorp: http://wse1.webcorp.org.uk/

Search in the entire Web as the corpus (basis: Google), Search by word, phrase or wildcard. KWIC concordances, word lists, some good advanced features. Disadvantage: not language-specific.

Literary text, media, and other archives

The archives listed below offer a variety of texts and smaller corpora for download. To search them with corpus analysis methods, you will normally need an offline text/corpus analysis tool, i.e. a concordancer. Alternatively, you may be able to carry out some simple analyses with online text analysis tools.

American Rhetoric Project: http://www.americanrhetoric.com/
More than 5000 full text, audio and (streaming) video versions of public speeches, sermons, legal proceedings, lectures, debates, interviews, other recorded media events.
Media Archive: https://archive.org/
A digital library of Internet sites and other cultural artifacts in digital form (text, audio, video).
Online Books Page: http://onlinebooks.library.upenn.edu/
(University of Pennsylvania) – literary texts. Free access to texts in different formats (meta search in a number of archives).
Project Gutenberg: https://www.gutenberg.org/
Over 50,000 free books including literary and other works out of copyright.
Literary web concordances: http://www.concordancesoftware.co.uk/webconcordances/
literary texts. Free online search (concordances and a range of interesting features).

Corpora for language learning

Corpora for learning: http://www.corpora4learning.net/

Links and references for the use of corpora, corpus linguistics and corpus analysis in the context of language learning and teaching. It also links to ongoing research and development projects.

Compleat Lexical tutor: http://www.lextutor.ca/

('text-based concordances' section) - analyze your own text: KWIC concordance for each word in the text. See also 'phrase extractor' section to build concordance with word clusters.

Sandra Gollin Kies Department of Languages and Literature Benedictine University	Daniel Kies Department of English College of DuPage	Cameron Nazer Mozafari Department of English U of Maryland, College Park
Corpus-Based Approaches to Writing: CCCC 2016 Workshop

Corpora

Free corpora

Literary text, media, and other archives

Corpora for language learning

Next, let's continue by discussing some methods for using the tools and the corpora.