bookworm is a text analyzer developed by Harvard's Cultural Observatory.
As of January 2019, the current dataset is 13.7 million digitized volumes. Volumes include books, periodicals, and government documents, including both in-copyright and public domain (out of copyright) materials. Currently, bookworm: HathiTrust is limited to searches of just single words.
Read more about it: https://guides.library.illinois.edu/c.php?g=348508&p=2347891
Other Installations of bookworm:
Part of the Internet Archive. Its corpus includes books published between 1700-2009.
Other English Corpora (includes links to these individual corpora):
iWeb: "The iWeb corpus contains 14 billion words (about 25 times the size of COCA) in 22 million web pages. It is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. Unlike other large corpora from the web, the nearly 95,000 websites in iWeb were chosen in a systematic way."
News on the Web (NOW): "The NOW corpus (News on the Web) contains 7.9 billion words of data from web-based newspapers and magazines from 2010 to the present time. More importantly, the corpus grows by about 140-160 million words of data each month (from about 300,000 new articles), or about 1.8 billion words each year."
Global Web-Based English (GloWbE): "The corpus of Global Web-based English (GloWbE; pronounced "globe") is unique in the way that it allows you to carry out comparisons between different varieties of English. GloWbE is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. GloWbE contains about 1.9 billion words of text from twenty different countries."
Wikipedia Corpus: "This corpus contains the full text of Wikipedia, and it contains 1.9 billion words in more than 4.4 million articles. But this corpus allows you to search Wikipedia in a much more powerful way than is possible with the standard interface. You can search by word, phrase, part of speech, and synonyms. You can also find collocates (nearby words), and see re-sortable concordance lines for any word or phrase."
Hansard Corpus: "The Hansard corpus contains nearly every speech given in the British Parliament from 1803-2005, and it allows you to search these speeches (including semantically-based searches) in ways that are not possible with any other resource."
Early English Books Online: "Early English Books Online (EEBO) is a collection of texts created by the Text Creation Partnership. The "open source" version that we have at this site contains 755 million words in 25,368 texts from the 1470s to the 1690s."
The TV Corpus: "The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc."
The Movie Corpus: "The Movie Corpus contains 200 million words of data in more than 25,000 movies from the 1930s to the current time. All of the 25,000+ movies are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, rating, genre, plot summary, etc."
Corpus of Supreme Court Opinions: "This corpus contains approximately 130 million words in 32,000 Supreme Court decisions from the 1790s to the current time. This corpus was released in March 2017. In Sep 2018, a similar corpus (covering the same period) was released by the BYU Law School."
TIME Magazine Corpus: "The TIME corpus is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006, and it serves as a great resource to examine changes in American English during this time."
British National Corpus (BNC): "The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of text texts from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic)."
Strathy Corpus (Canada): "The Strathy Corpus of Canadian English is a product of the Strathy Language Unit at Queen's University. The corpus contains 50 million words from more than 1,100 spoken, fiction, magazines, newspapers, and academic texts."
The JSTOR corpus is freely available and includes data from scholarly journal literature (more than 12 million articles), primary sources (26,000 19th Century British Pamphlets), and Books. However, the DIY, create your own dataset option. does require a free registration, This option allows the analysis of metadata (with references) and word counts, bi-grams, and tri-grams for up to 25,000 articles at a time.
Once you've created your dataset, the raw data is downloaded to a zipfile which is sent to your email account. The zipfile then must be uploaded to a tool like Voyant for further analysis; Custom datasets may also be created.
Library DH Support:
O'Leary, M. (2014). Cool Corpora. Information Today, 31(10), 16.
Russell, J. (2017). Supporting digital humanities: the basics. Online Searcher, (2), 49-52.
Secker, J., Morrison, C. M., Stewart, N., & Horton, L. (2016). To boldly go… the librarian's role in text and data mining. CILIP Update Magazine.
Text Mining Examples:
Craig, K. (2018). Introduction to Bookworm; Robots Reading Vogue; Bookworm: HathiTrust; Bookworm: Open Library; Building a Bookworm. Journal of American History, 105(1), 244–247. https://doi.org/10.1093/jahist/jay139
Genovese, J. E. C. (2015). Interest in Astrology and Phrenology over Two Centuries: A Google Ngram Study. Psychological Reports, 117(3), 940–943. https://doi.org/10.2466/17.PR0.117c27z8
Google Ngram Viewer Shows Changing Language Around Distance Learning. (2014). Distance Education Report, 18(16), 6.
Laudun, J., & Goodwin, J. (2013). Computing folklore studies: Mapping over a century of scholarly production through topics. The Journal of American Folklore, 126(502), 455-475. [JSTOR Data for Research]
Regan, A. (2017). Mining Mind and Body: Approaches and Considerations for Using Topic Modeling to Identify Discourses in Digitized Publications. Journal of Sport History, 44(2), 160-177.
Younes, N., & Reips, U.-D. (2019). Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms. PLoS ONE, 14(3), 1–17. https://doi.org/10.1371/journal.pone.0213554