Corpus research &
your presentations
Daniel Kies
Professor Emeritus College of DuPage August, 2020
To navigate these slides:
1. Click/press the left or right arrows on your screen or keyboard.
2. Swipe left or right on your touchscreen.
Legend: Navigation Shortcuts Full screen Pause Write/draw Chalkboard
Shift + ? displays shortcuts again.
KEY [ comma = or ]
ACTION
N , Space
Next slide
P
Previous slide
← , H
Navigate left
→ , L
Navigate right
Home , Shift ←
First slide
End , Shift →
Last slide
B , .
Pause
F
Full screen
Esc , O
Slide overview
CapsLock
Toggle Mouse Pointer
V
Courseware view
M
Toggle menu
Alt + click
Zoom in or out
Introductory
To say anything sensible about language, we need some facts, some data — the real language of real people. A corpus (a word from
Latin) literally mean a body, and in our case we are referring to a large body of real texts from real speakers and/or writers. Several sites have
collected many millions of words for us to investigate how language really works.
In this presentation, we will learn about two free, easy to use, professional-grade research tools for studying large corpora — Google Books Ngram Viewer
and English-corpora.org.
We will conclude this discussion by showing how we can import the information we glean from corpora into our research presentations for this course.
Google books' ngram viewer
ngram (or n-gram ): a measure of the likelihood that a particular word will appear in a
two word, three word, four word, etc., string of words.
Google Books Ngram Viewer allows us to query Google's collection of more than 20 million digitized
books written between 1500 and 2019, giving us a corpus of billions of words.
Google Books Ngram Viewer has a simple to use interface that still enables enormously powerful searches.
First, a little more about what ngram means. The n stands for a number, of course, and the gram represents the token of study. In the
study of language, the gram could be a phoneme, a morpheme, a word, a string of words, etc. In biology, the gram could be a protein in
a string of proteins (such as DNA). In other words, the gram could be anything that occurs in an sequence.
Its significance arises from the fact that the ngram not only expresses the frequency of the token (the gram ) we are studying, but also expresses the
strength of the relationship between the units in the grams . This allows us to predict the next gram . Prediction is the holy grail of science.
Why is this important? Ngrams reveal which 2, 3, 4, etc., sequences are treated as single units. Those revelations have significance for our understanding
of cognition, language learning, and artificial intelligence (especially speech recognition software in human-machine interactions; think talking
robots here, like Alexa, Siri, and HAL from Arthur
C. Clarke's 2001: A Space Odyssey , etc.).
Generating Ngrams
On the screen behind the slide is the Google Books Ngram Viewer web page. When we visit the page, we notice that Google has pre-loaded a search for three
names, Albert Einstein, Sherlock Holmes, Frankenstein so that we can see immediately what the output of an ngram search looks like.
Using Ngram Viewer is just that easy. Place a list of items we want to compare into the search box separated by a comma, press Enter on the keyboard,
and enjoy! Try the Ngram Viewer for yourself:
Thanks to the Ngram Viewer I now know that there are some changes in usage that I, personally, find pleasing. For example, Microsoft Word always annoyed me
by telling me that the words web and internet should be capitalized. Word's spell checker started doing this from the 1990s, and it drove me to
distraction! Web and internet are not proper nouns, so why capitalize them? The web and the internet are not locations like Chicago or Naperville.
They seem to me more like things, such as radio and television. (#Keep!The!Abstract!Nouns!Common!)
So I confess I had a smirk on my face when the Ngram viewer confirmed that Internet was still the preferred spelling
but falling dramatically in use, while internet is steadily gaining . And to my further satisfaction,
web beats Web
as the preferred usage, even though Web had a huge advantage in the early years of the world wide web.
Notice, by the way, how easily I can place the Ngram result into this presentation as a link. You can do the same with your PowerPoint presentations for your
course. Just create your comparison of the features you are studying in Ngram Viewer, and then copy/paste the URL from your web browser into your PowerPoint slide.
You're done.
On the Google Books Ngram Viewer web page, you will notice four buttons/drop down lists to control how the Viewer functions. These help us refine our search
more precisely, improving the value of the results of our inquiries.
The first drop down button allow us to choose the years for our search. As of September 2020, the year limits are
between 1500 and 2019.
The second drop down button allow us to choose the corpus for our search. The significant options for us include
British vs. American English and fiction vs. non-fiction. For example, America achieved political independence from England in 1776, but when did Americans stop using
British English varieties of word choice, spelling, etc? By selecting the American English 2019 corpus (the latest, largest of the American corpus available, we can
then search for some examples of different spellings or different vocabulary between British and American English to get a sense of
when American English began to dominate over its British English beginnings . As the data show, the American spellings didn't reach equal numbers with the British
English spellings until the first three decades of the 1800s, and the complete dominance of American English spellings did not occur until about 1900. Language
change is slow. It takes generations.
The third button allow us to toggle case sensitivity on or off for our search. The default is off ,
i.e., case-sensitive. In this way, Web will return different results than a search for web . This button becomes valuable when you want to
know the
total frequency of a word or phrase, such as however compared to but . Here a case-insensitive search is better, because that word or phrase
might be at the beginning of a sentence and include a capitalized variation.
The fourth drop down button allow us to control the degree of smoothing for our search. Smoothing refers to the
number of years to include in a calculation of the running average total of n years (where n refers to a number again). I can choose 0 years to 50 years to include
in the running average of my search. The default is 3 years, unless we explicitly change the number. To demonstrate its value, let's consider again the default search
that displays on Ngram Viewer when we first visit the web site. As a default Google uses the
search for the names Albert Einstein,Sherlock Holmes,Frankenstein from 1800 to 2019. Now lets add the name Donald Trump to that search:
Albert Einstein,Sherlock Holmes,Frankenstein,Donald Trump . Notice that Donald Trump occurs in the corpus more than all the other three. Remember,
though, that the default smoothing (running average) is 3 years. That means an average of prior three years for each year included in our search (1800-2019). So it makes
sense that Donald Trump should occur so frequently, as the graph shows. In calculating the average in three year increments, Donald Trump has had a lot of press since
winning the election of 2016. However, if we change the smoothing from its default setting to 50 (meaning we are taking a 50 year running average — a 50 year
view through our search's time period),
then Donald Trump ranks last among the four names . And that makes sense too, since a 50 years running average will level out the effects of short term
spikes in the data, allowing us to recognize that more people have been writing about the other names more often than writing about Donald Trump
over time.
Ngram Viewer help and options
If you would like to learn more about Ngram Viewer and some of its powerful features, click on the icon on the Google Books Ngram
Viewer page, or click the link below:
Ngram Viewer options include:
wildcard searches
inflection searches
case-insensitive searches
part-of-speech searches
ngram combination searches
All five of those options are valuable for English language research, and we did discuss case-sensitivity already. For an example of another Ngram option, I want to ask if you
remember that prescriptivist prohibition against starting a sentence with a conjunction, like and, or, but, because , etc? That prohibition starts with Robert Lowth's
grammar (1762). How effective was that prescriptive rule at changing writers' behaviors? Well,
a part of speech search can show us the facts . By using the part of speech tags mentioned on the Ngram Viewer information page, we can create a search ( _START_ _ADV_
compared to _START_ _CONJ_) highlighting occurrences of adverbs (like however ) at the beginning of a sentence along with occurrences of conjunctions (like but )
at the beginning of a sentence from 1800 to 2019.
Frankly, those results amazed me. The graph clearly shows how effectively prescriptive rules like that one prohibition against sentence-initial conjunctions changed
writers' real life practices over time. At the time Lowth wrote his grammar, writers used conjunctions far more often than adverbs sentence-initially. While the use of sentence-initial
adverbs remains relatively unchanged post-Lowth, we can see a steady decline of sentence-initial conjunctions from 1800 to 1990, a far greater decline than I would have
guessed.1
I promise that you will be richly rewarded if you do read that Ngram Viewer information and options page.
1 Please take this footnote as a cautionary tale: remember that the Google Books Ngram Viewer (mostly) contains the contents of books. As we will see in the
next slide's discussion of English-corpora.org, if we explore the texts of journalism (newspapers and news magazines), we will see that this prescriptive prohibition was not
uniform throughout the creation of all edited, written English. Register and genre choices significantly change results.
English-corpora.org tools
The most varied and comprehensive collection of corpora available
A powerful, easy to use search tool, allowing us to view our search results in greater detail and in context
Provides the possibility to search across different corpora
Allows us to narrow and compare our search by time periods, registers, genres, and modality (spoken/written English)
Using english-corpora.org
On the screen behind the slide is the English-corpora.org web page. Among all the corpora available for us to search, the one that best suits our purposes for our
course is the Corpus of Contemporary American English (COCA). When you first visit the COCA home page, you will see a toolbar at the top. One of those icons
offers
a tour of the COCA site . I recommend reading that page. You won't regret the time spent.
Below the top toolbar which holds the tour icon, we see the Search toolbar. This set of options allow us to get greater details about the searches we do.
The links above the search box allow us different options for displaying search results. Below the search box, we see more options to select the corpus we want to
search. The most important option (Sections ) allows us to select or compare searches in specific genres or registers. This is a valuable option when our
research topic focuses on works of fiction or journalism or spoken English, etc. It also allows us to compare search results in our genre or register of interest against results in
another genre or register. Thus, we can compare features of spoken English against written English, or fiction against non-fiction, or academic writing against journalism,
and many more. We can also combine genres or registers or time periods, allowing us to narrow our search, for example, to blogs about family-themed movies. A very useful
search feature.
Search can be done with words or strings of words; however, we can also use the POS options to the right of the search box to augment the search by
adding part of speech tags or punctuation.
After 10-15 queries, english-corpora.org will ask us to register with the web site.
Registration is free . I suggest that we simply register straight-away to avoid having to stop mid-search just to register.
Even after we register, from time to time, we will see a nag screen asking us to consider buying a paid
subscription to the web site. When we see that screen, we might think that our search has been canceled, and we might want to give up or to go back to try again.
However, if we just wait about 20 seconds, we will see that a link will appear, allowing us to get our search results .
The easiest way of saving our search results is to highlight the results we want to save and then to copy/paste the results into a PowerPoint or word processing document.
Using your research data
Ngram Viewer: copy and paste the URL of the search result directly into your document
English-corpora.org: highlight the results we want to save and then to copy/paste the results into document
Alternatively: take a screen capture and crop the image. Insert the image into your presentation
Background image credits
Dr. Johnson's Dictionary , Yale University Library https://drjohnsonsdictionary.library.yale.edu/
Feinberg, J. (2014). Wordle (Version 0.2) [Computer software]. Available from http://www.wordle.net/
Feinberg's Wordle is a handy, easy to use program for creating word clouds from any text you feed to Wordle . These word clouds are interesting
graphic representations of the most frequently used words in that text and make excellent graphics for a report or as background images.
Finis.
https://rhetory.com/corpus-research
kiesdan@dupage.edu
Send me an email if you would like some help using these tools or if you would like to ask a question or you would like to continue this discussion.
00:00
Resume presentation
Corpus research & your presentations Daniel Kies Professor Emeritus College of DuPage August, 2020 To navigate these slides: 1. Click/press the left or right arrows on your screen or keyboard. 2. Swipe left or right on your touchscreen. Legend: Navigation Shortcuts Full screen Pause Write/draw Chalkboard Introductory To say anything sensible about language, we need some facts, some data — the real language of real people. A corpus (a word from
Latin) literally mean a body, and in our case we are referring to a large body of real texts from real speakers and/or writers. Several sites have
collected many millions of words for us to investigate how language really works. In this presentation, we will learn about two free, easy to use, professional-grade research tools for studying large corpora — Google Books Ngram Viewer
and English-corpora.org. We will conclude this discussion by showing how we can import the information we glean from corpora into our research presentations for this course.