|
Sandra Gollin Kies Department of Languages and Literature Benedictine University |
Daniel Kies Department of English College of DuPage |
Cameron Nazer Mozafari Department of English U of Maryland, College Park |
Many linguists have prepared a wide range of free corpus analysis tools. Chief among them combine both corpora and the analytical tools to analyze those corpora.
CORPORA, produced by Mark Davis at Brigham Young University, offers a host of corpora, each between 1.9 billion to 45 million words each, and a host of tools by which we can query the data.
Figure 2: corpus.byu.eduLanding page, displaying the corpora available.![]() |
After choosing a corpus for investigation, an appropriate set of search tools will appear:
Figure 3: corpus.byu.edu/cocaLanding page, after choosing the COCA corpus.![]() |
Below we see the results screen after choosing the COMPARE tool and searching for each token of toward and towards. Note the small column of ? icons, which provide contextualized help for each tool choice or option.
Figure 4: corpus.byu.edu/cocaResults, comparing toward and towards in COCA![]() |
One of the few resources to provide access to spoken English data (152 transcripts (totaling 1,848,364 words)). Here we can query words or phrases used by faculty and students, undergrads and grads, non-native and native speakers of English, in a variety of contexts. Some wildcard characters are allowed. It's tools lack the robustness of CORPORA's, but its focus on spoken English with students makes it valuable. A key word in context (KWIC) utility (with statistics) seems the only tool available at this time.
Figure 5: MICASEThe search facility in MICASE![]() |
A KWIC tool (with statistics) to explore 829 upper-level (senior undergrads and three years of graduates) student papers.
Figure 6: MICUSPThe search facility in MICUSP![]() |
The tools above provide powerful software and interesting data sets for investigation. CORPORA even lets users explore about 3000 words of their own data with its tools and compares those results against the findings available in COCA. However, what if we wish to use tools like those to explore our own data sets? Larger data sets? The following free software packages (for Windows and other systems) allow us this freedom.
A powerful, full featured, concordancing program for text analysis. Below we see a typical results screen for a search of socialism in a collection of texts.
Figure 7: AntConcAntConc, a free, powerful concodancer![]() |
UAM CorpusTool provides a bevy of tools for the analysis of small or large data sets.
Figure 8: UAM Corpus ToolUAM CT, a free, powerful concodancer, parser, statistics generator![]() |
Below we see a UAM CT screen after we have incorporated the data files and created four layers for analysis — SFL Transitivity, SFL Mood, Stanford Part of Speech (POS) twice.
Figure 9: UAM Corpus ToolUAM CT, selected files with 4 layers added![]() |
A screen demonstrating the statics of all words (tokens) in the corpus. Note that function words are NOT searched below, accounting for the fact that the definite and indefinite articles (the and a) are not the most frequent words.
Figure 10: UAM Corpus ToolUAM CT, search results for search of all words showing frequency![]() |
Below, we see the "about-ness" (the keywords) of the HEROS corpus. (This corpus — response essays giving the students' reactions to a Newsweek article about contemporary heros — is a collection of essays written by incoming students at the College of DuPage as part of an exercise to do local norming of the college's placement exams.)
Figure 11: UAM Corpus ToolUAM CT, search for keyness of all nouns in corpus![]() |
Now, we see the nouns in the HEROS corpus in context. This is a typical KWIC output.
Figure 12: UAM Corpus ToolUAM CT, concordance of all nouns in corpus![]() |
One of the most valuable features, UAM CorpusTool provides several different parsers:
Figure 13: UAM Corpus ToolUAM CT, a POS parse of the corpus, with tree and table views![]() ![]() |
Lastly, let's look at the concordancer on papyr.com. The advantage of this tool is that we can upload our own files to the server. The disadvantage is that it needs Java to run, it is limited to KWIC queries, and it searches one file at a time.
Figure 14:
|