Bigram – An n-gram in which two words appear together in a text. Examples of bigrams from the Gettysburg Address include "four score," "score and," "and seven," etc.
Collocation – The statistical tendency of words to appear near each other. It can reveal a popular composition of a phrase within a language. (ex. "crystal clear", "middle management").
Concordance Lines – Samples of text that show the instances of a search word or phrase found in a corpus.
Corpus – A collection of more than one text.
Corpus Linguistics – The study of language as expressed in samples (corpora) of real world text.
Culturomics – The application of high-throughput data collection and analysis to the study of human culture.
Data Visualization – The process of converting a data source into a visual representation, such as a graph or a map.
Hapax – A word that occurs only once in a text.
KWIC (Key Word in Context) – A common format for a concordance. A KWIC index is formed by sorting and aligning words within a corpus to allow each word to be searchable in an index.
N-gram – A contiguous chain of n-items (ex. words) from a sample of text. Common types are unigrams (one word), bigrams (two words), and trigrams (three words).
Smoothing – Removes jagged and discontinuous data points by creating a moving average over the data, which can make it easier to see trends. For instance, setting the smoothing to "1" averages one point on each side of a data point.
Word Cloud – A graphical representation of word frequency within a corpus. The larger the word in a cloud, the higher the frequency.
Word Token – Each word occurring in a text or corpus.
Word Type – A unique word within a corpus.
Words per Million – A metric that normalizes the frequency of a search term between corpora of different sizes.
Guide created by University Librarian Emeritus Colleen Seale & Associate University Librarian Cindy Craig
bookworm is a text analyzer developed by Harvard's Cultural Observatory.
This project developed by Lindsay King, Yale University, Robert B. Haas Family Arts Library and Peter Leonard, Yale Digital Humanities Lab. Its corpus is the Vogue Archive (ProQuest), with every issue published from 1892-2013 with over 2,700 covers, 400,000 pages and 6TB of data.
As of January 2019, the current dataset is 13.7 million digitized volumes. Volumes include books, periodicals, and government documents, including both in-copyright and public domain (out of copyright) materials. Currently, bookworm: HathiTrust is limited to searches of just single words.
Read more about it: https://guides.library.illinois.edu/c.php?g=348508&p=2347891
Other Installations of bookworm:
Part of the Internet Archive. Its corpus includes books published between 1700-2009.
Corpus of Contemporary American English
Other Corpora:
Corpus of Historical American English (COHA): COHA contains more than 400 million words of text from the 1810s--2000s. Bibliography: https://www.english-corpora.org/coha/
Other English Corpora (includes links to these individual corpora):
iWeb: "The iWeb corpus contains 14 billion words (about 25 times the size of COCA) in 22 million web pages. It is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. Unlike other large corpora from the web, the nearly 95,000 websites in iWeb were chosen in a systematic way."
News on the Web (NOW): "The NOW corpus (News on the Web) contains 7.9 billion words of data from web-based newspapers and magazines from 2010 to the present time. More importantly, the corpus grows by about 140-160 million words of data each month (from about 300,000 new articles), or about 1.8 billion words each year."
Global Web-Based English (GloWbE): "The corpus of Global Web-based English (GloWbE; pronounced "globe") is unique in the way that it allows you to carry out comparisons between different varieties of English. GloWbE is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. GloWbE contains about 1.9 billion words of text from twenty different countries."
Wikipedia Corpus: "This corpus contains the full text of Wikipedia, and it contains 1.9 billion words in more than 4.4 million articles. But this corpus allows you to search Wikipedia in a much more powerful way than is possible with the standard interface. You can search by word, phrase, part of speech, and synonyms. You can also find collocates (nearby words), and see re-sortable concordance lines for any word or phrase."
Hansard Corpus: "The Hansard corpus contains nearly every speech given in the British Parliament from 1803-2005, and it allows you to search these speeches (including semantically-based searches) in ways that are not possible with any other resource."
Early English Books Online: "Early English Books Online (EEBO) is a collection of texts created by the Text Creation Partnership. The "open source" version that we have at this site contains 755 million words in 25,368 texts from the 1470s to the 1690s."
The TV Corpus: "The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc."
The Movie Corpus: "The Movie Corpus contains 200 million words of data in more than 25,000 movies from the 1930s to the current time. All of the 25,000+ movies are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, rating, genre, plot summary, etc."
Corpus of Supreme Court Opinions: "This corpus contains approximately 130 million words in 32,000 Supreme Court decisions from the 1790s to the current time. This corpus was released in March 2017. In Sep 2018, a similar corpus (covering the same period) was released by the BYU Law School."
TIME Magazine Corpus: "The TIME corpus is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006, and it serves as a great resource to examine changes in American English during this time."
Corpus of American Soap Operas: "The SOAP corpus contains 100 million words of data from 22,000 transcripts from American soap operas from the early 2000s."
British National Corpus (BNC): "The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of text texts from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic)."
Strathy Corpus (Canada): "The Strathy Corpus of Canadian English is a product of the Strathy Language Unit at Queen's University. The corpus contains 50 million words from more than 1,100 spoken, fiction, magazines, newspapers, and academic texts."
CORE Corpus: "This corpus contains more than 50 million words of text from the web and this is the first large web-based corpus that is carefully categorized into many different registers."
The JSTOR corpus is freely available and includes data from scholarly journal literature (more than 12 million articles), primary sources (26,000 19th Century British Pamphlets), and Books. However, the DIY, create your own dataset option. does require a free registration, This option allows the analysis of metadata (with references) and word counts, bi-grams, and tri-grams for up to 25,000 articles at a time.
Once you've created your dataset, the raw data is downloaded to a zipfile which is sent to your email account. The zipfile then must be uploaded to a tool like Voyant for further analysis; Custom datasets may also be created.
Library DH Support:
O'Leary, M. (2014). Cool Corpora. Information Today, 31(10), 16.
Russell, J. (2017). Supporting digital humanities: the basics. Online Searcher, (2), 49-52.
Secker, J., Morrison, C. M., Stewart, N., & Horton, L. (2016). To boldly go… the librarian's role in text and data mining. CILIP Update Magazine.
Text Mining Examples:
Craig, K. (2018). Introduction to Bookworm; Robots Reading Vogue; Bookworm: HathiTrust; Bookworm: Open Library; Building a Bookworm. Journal of American History, 105(1), 244–247. https://doi.org/10.1093/jahist/jay139
Genovese, J. E. C. (2015). Interest in Astrology and Phrenology over Two Centuries: A Google Ngram Study. Psychological Reports, 117(3), 940–943. https://doi.org/10.2466/17.PR0.117c27z8
Google Ngram Viewer Shows Changing Language Around Distance Learning. (2014). Distance Education Report, 18(16), 6.
Laudun, J., & Goodwin, J. (2013). Computing folklore studies: Mapping over a century of scholarly production through topics. The Journal of American Folklore, 126(502), 455-475. [JSTOR Data for Research]
Regan, A. (2017). Mining Mind and Body: Approaches and Considerations for Using Topic Modeling to Identify Discourses in Digitized Publications. Journal of Sport History, 44(2), 160-177.
Younes, N., & Reips, U.-D. (2019). Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms. PLoS ONE, 14(3), 1–17. https://doi.org/10.1371/journal.pone.0213554