Corpus-Based Approaches to Writing: CCCC 2016 Workshop

Tools

Many linguists have prepared a wide range of free corpus analysis tools. Chief among them combine both corpora and the analytical tools to analyze those corpora.

Web sites that contain both corpora and tools for the analysis of corpora

CORPORA — corpus.byu.edu

CORPORA, produced by Mark Davis at Brigham Young University, offers a host of corpora, each between 1.9 billion to 45 million words each, and a host of tools by which we can query the data.

Figure 2: corpus.byu.edu

Landing page, displaying the corpora available.

After choosing a corpus for investigation, an appropriate set of search tools will appear:

Figure 3: corpus.byu.edu/coca

Landing page, after choosing the COCA corpus.

Below we see the results screen after choosing the COMPARE tool and searching for each token of toward and towards. Note the small column of ? icons, which provide contextualized help for each tool choice or option.

Figure 4: corpus.byu.edu/coca

Results, comparing toward and towards in COCA

Cameron's demonstration: Working with COCA

MICASE – Michigan Corpus of Academic Spoken English

Did you know?

We can use a search engine to do simple text queries. Using Boolean operators, we can query the web whether toward or towards appears more commonly. Compare the results for toward (2180 hits) and towards (1570 hits). See SotA for a discussion of the difference between these results and COCA's.

Did you also know?

A more complex example lets us compare the history of paraphrastic and inflectional genitive using COCA and COHA.

One of the few resources to provide access to spoken English data (152 transcripts (totaling 1,848,364 words)). Here we can query words or phrases used by faculty and students, undergrads and grads, non-native and native speakers of English, in a variety of contexts. Some wildcard characters are allowed. It's tools lack the robustness of CORPORA's, but its focus on spoken English with students makes it valuable. A key word in context (KWIC) utility (with statistics) seems the only tool available at this time.

Figure 5: MICASE

The search facility in MICASE

MICUSP – Michigan Corpus of Upper-Level Student Papers

A KWIC tool (with statistics) to explore 829 upper-level (senior undergrads and three years of graduates) student papers.

Figure 6: MICUSP

The search facility in MICUSP

The tools above provide powerful software and interesting data sets for investigation. CORPORA even lets users explore about 3000 words of their own data with its tools and compares those results against the findings available in COCA. However, what if we wish to use tools like those to explore our own data sets? Larger data sets? The following free software packages (for Windows and other systems) allow us this freedom.

Web sites that provide free tools for the analysis of corpora

AntConc

A powerful, full featured, concordancing program for text analysis. Below we see a typical results screen for a search of socialism in a collection of texts.

Figure 7: AntConc

AntConc, a free, powerful concodancer

UAM CorpusTool

UAM CorpusTool provides a bevy of tools for the analysis of small or large data sets.

Figure 8: UAM Corpus Tool

UAM CT, a free, powerful concodancer, parser, statistics generator

Below we see a UAM CT screen after we have incorporated the data files and created four layers for analysis — SFL Transitivity, SFL Mood, Stanford Part of Speech (POS) twice.

Figure 9: UAM Corpus Tool

UAM CT, selected files with 4 layers added

A screen demonstrating the statics of all words (tokens) in the corpus. Note that function words are NOT searched below, accounting for the fact that the definite and indefinite articles (the and a) are not the most frequent words.

Figure 10: UAM Corpus Tool

UAM CT, search results for search of all words showing frequency

Below, we see the "about-ness" (the keywords) of the HEROS corpus. (This corpus — response essays giving the students' reactions to a Newsweek article about contemporary heros — is a collection of essays written by incoming students at the College of DuPage as part of an exercise to do local norming of the college's placement exams.)

Figure 11: UAM Corpus Tool

UAM CT, search for keyness of all nouns in corpus

Now, we see the nouns in the HEROS corpus in context. This is a typical KWIC output.

Figure 12: UAM Corpus Tool

UAM CT, concordance of all nouns in corpus

One of the most valuable features, UAM CorpusTool provides several different parsers:

Data for Daniel's demonstration: UAM CorpusTool or AntConc

Conrad's Heart of Darkness: for another demonstration

Systemic functional linguistics (SFL) mood, transitivity, and theme parsing
Rhetorical structure theory (RST) parsing
Stanford part of speech (POS) parser/tagger

Figure 13: UAM Corpus Tool

UAM CT, a POS parse of the corpus, with tree and table views

Lastly, let's look at the concordancer on papyr.com. The advantage of this tool is that we can upload our own files to the server. The disadvantage is that it needs Java to run, it is limited to KWIC queries, and it searches one file at a time.

Sandra Gollin Kies Department of Languages and Literature Benedictine University	Daniel Kies Department of English College of DuPage	Cameron Nazer Mozafari Department of English U of Maryland, College Park
Corpus-Based Approaches to Writing: CCCC 2016 Workshop

Tools

CORPORA — corpus.byu.edu

Figure 2: corpus.byu.edu

Landing page, displaying the corpora available.

Figure 3: corpus.byu.edu/coca

Landing page, after choosing the COCA corpus.

Figure 4: corpus.byu.edu/coca

Results, comparing toward and towards in COCA

MICASE – Michigan Corpus of Academic Spoken English

Figure 5: MICASE

The search facility in MICASE

MICUSP – Michigan Corpus of Upper-Level Student Papers

Figure 6: MICUSP

The search facility in MICUSP

AntConc

Figure 7: AntConc

AntConc, a free, powerful concodancer

UAM CorpusTool

Figure 8: UAM Corpus Tool

UAM CT, a free, powerful concodancer, parser, statistics generator

Figure 9: UAM Corpus Tool

UAM CT, selected files with 4 layers added

Figure 10: UAM Corpus Tool

UAM CT, search results for search of all words showing frequency

Figure 11: UAM Corpus Tool

UAM CT, search for keyness of all nouns in corpus

Figure 12: UAM Corpus Tool

UAM CT, concordance of all nouns in corpus

Figure 13: UAM Corpus Tool

UAM CT, a POS parse of the corpus, with tree and table views

Figure 14:
papyr.com/applets/concordancer

A free KWIC concordance with a file upload utility

Next, let's continue with the (free) corpora available to us for exploring language use.

Tools

CORPORA — corpus.byu.edu

Figure 2: corpus.byu.edu

Landing page, displaying the corpora available.

Figure 3: corpus.byu.edu/coca

Landing page, after choosing the COCA corpus.

Figure 4: corpus.byu.edu/coca

Results, comparing toward and towards in COCA

MICASE – Michigan Corpus of Academic Spoken English

Figure 5: MICASE

The search facility in MICASE

MICUSP – Michigan Corpus of Upper-Level Student Papers

Figure 6: MICUSP

The search facility in MICUSP

AntConc

Figure 7: AntConc

AntConc, a free, powerful concodancer

UAM CorpusTool

Figure 8: UAM Corpus Tool

UAM CT, a free, powerful concodancer, parser, statistics generator

Figure 9: UAM Corpus Tool

UAM CT, selected files with 4 layers added

Figure 10: UAM Corpus Tool

UAM CT, search results for search of all words showing frequency

Figure 11: UAM Corpus Tool

UAM CT, search for keyness of all nouns in corpus

Figure 12: UAM Corpus Tool

UAM CT, concordance of all nouns in corpus

Figure 13: UAM Corpus Tool

UAM CT, a POS parse of the corpus, with tree and table views

Figure 14: papyr.com/applets/concordancer

A free KWIC concordance with a file upload utility

Next, let's continue with the (free) corpora available to us for exploring language use.

Figure 14:
papyr.com/applets/concordancer