Sandra Gollin Kies
Department of Languages and Literature
Benedictine University



Daniel Kies
Department of English
College of DuPage



Cameron Nazer Mozafari
Department of English
U of Maryland, College Park



Corpus-Based Approaches to Writing: CCCC 2016 Workshop



Tools

Many linguists have prepared a wide range of free corpus analysis tools. Chief among them combine both corpora and the analytical tools to analyze those corpora.



CORPORA — corpus.byu.edu

CORPORA, produced by Mark Davis at Brigham Young University, offers a host of corpora, each between 1.9 billion to 45 million words each, and a host of tools by which we can query the data.

Figure 2: corpus.byu.edu

Landing page, displaying the corpora available.

List of available corpora

After choosing a corpus for investigation, an appropriate set of search tools will appear:

Figure 3: corpus.byu.edu/coca

Landing page, after choosing the COCA corpus.

Tool panes, screen layout

Below we see the results screen after choosing the COMPARE tool and searching for each token of toward and towards. Note the small column of ? icons, which provide contextualized help for each tool choice or option.

Figure 4: corpus.byu.edu/coca

Results, comparing toward and towards in COCA

Results screen






Cameron's demonstration: Working with COCA







MICASE – Michigan Corpus of Academic Spoken English

Did you know?

We can use a search engine to do simple text queries. Using Boolean operators, we can query the web whether toward or towards appears more commonly. Compare the results for toward (2180 hits) and towards (1570 hits). See SotA for a discussion of the difference between these results and COCA's.

Did you also know?

A more complex example lets us compare the history of paraphrastic and inflectional genitive using COCA and COHA.

One of the few resources to provide access to spoken English data (152 transcripts (totaling 1,848,364 words)). Here we can query words or phrases used by faculty and students, undergrads and grads, non-native and native speakers of English, in a variety of contexts. Some wildcard characters are allowed. It's tools lack the robustness of CORPORA's, but its focus on spoken English with students makes it valuable. A key word in context (KWIC) utility (with statistics) seems the only tool available at this time.







Figure 5: MICASE

The search facility in MICASE

MICASE




MICUSP – Michigan Corpus of Upper-Level Student Papers

A KWIC tool (with statistics) to explore 829 upper-level (senior undergrads and three years of graduates) student papers.

Figure 6: MICUSP

The search facility in MICUSP

MISCUSP





The tools above provide powerful software and interesting data sets for investigation. CORPORA even lets users explore about 3000 words of their own data with its tools and compares those results against the findings available in COCA. However, what if we wish to use tools like those to explore our own data sets? Larger data sets? The following free software packages (for Windows and other systems) allow us this freedom.





AntConc

A powerful, full featured, concordancing program for text analysis. Below we see a typical results screen for a search of socialism in a collection of texts.

Figure 7: AntConc

AntConc, a free, powerful concodancer

AntConc, a free concordancer




UAM CorpusTool

UAM CorpusTool provides a bevy of tools for the analysis of small or large data sets.

Figure 8: UAM Corpus Tool

UAM CT, a free, powerful concodancer, parser, statistics generator

UAMCT opening screen

Below we see a UAM CT screen after we have incorporated the data files and created four layers for analysis — SFL Transitivity, SFL Mood, Stanford Part of Speech (POS) twice.

Figure 9: UAM Corpus Tool

UAM CT, selected files with 4 layers added

UAMCT selected files screen

A screen demonstrating the statics of all words (tokens) in the corpus. Note that function words are NOT searched below, accounting for the fact that the definite and indefinite articles (the and a) are not the most frequent words.

Figure 10: UAM Corpus Tool

UAM CT, search results for search of all words showing frequency

UAMCT frequency of all words

Below, we see the "about-ness" (the keywords) of the HEROS corpus. (This corpus — response essays giving the students' reactions to a Newsweek article about contemporary heros — is a collection of essays written by incoming students at the College of DuPage as part of an exercise to do local norming of the college's placement exams.)

Figure 11: UAM Corpus Tool

UAM CT, search for keyness of all nouns in corpus

UAMCT frequency of all words

Now, we see the nouns in the HEROS corpus in context. This is a typical KWIC output.

Figure 12: UAM Corpus Tool

UAM CT, concordance of all nouns in corpus

UAMCT frequency of all words

One of the most valuable features, UAM CorpusTool provides several different parsers:

Data for Daniel's demonstration: UAM CorpusTool or AntConc


Conrad's Heart of Darkness: for another demonstration

Figure 13: UAM Corpus Tool

UAM CT, a POS parse of the corpus, with tree and table views

UAMCT tree view of corpus parsing

UAMCT table parsing of corpus

Lastly, let's look at the concordancer on papyr.com. The advantage of this tool is that we can upload our own files to the server. The disadvantage is that it needs Java to run, it is limited to KWIC queries, and it searches one file at a time.

Figure 14:
papyr.com/applets/concordancer

A free KWIC concordance with a file upload utility

UAMCT tree view of corpus parsing

Next, let's continue with the (free) corpora available to us for exploring language use.