Guides @ UF: Free Text Mining Tools to Explore Women’s and Gender Studies Literature: Home

Striking Gold

Guide created by University Librarian Emeritus Colleen Seale & Associate University Librarian Cindy Craig

Ngram Tools

A tool developed by Google that analyzes the yearly count of words and phrases found in over 5.2 million books digitized by Google and published between 1500-2008.
Corpora include American English, British English, English Fiction, French, German, Hebrew, Chinese, and Russian texts.
Search options allow use of wildcards in queries and search for inflections, perform case insensitive searches, look for particular parts of speech, or add, subtract, and divide ngrams.

Bookworm Instances

bookworm is a text analyzer developed by Harvard's Cultural Observatory.

This project developed by Lindsay King, Yale University, Robert B. Haas Family Arts Library and Peter Leonard, Yale Digital Humanities Lab. Its corpus is the Vogue Archive (ProQuest), with every issue published from 1892-2013 with over 2,700 covers, 400,000 pages and 6TB of data.

Search options: Search all texts or restrict the search to author list or genre: advertisement, article, fashion shoot, cover, table of contents masthead, index, poem, fiction, letter to the editor, letter from the editor, contributors.
Search results provide a title and link out to the ProQuest database which provides citation information: volume, issue, date and pagination.
Display options: For the dependent variable vertical y-axis choose words per million, percent of text, number of words or number of texts. On the independent horizontal x-axis, data is plotted over years broken down by editors-in-chief: Redding (1892- 1901), Harrison (1901-1914), Chase (1914-1951), Daves (1952-1962), Vreeland (1963-1971), Mirabella (1971-1988), and Wintour (1988-present).

bookworm: HathiTrust

As of January 2019, the current dataset is 13.7 million digitized volumes. Volumes include books, periodicals, and government documents, including both in-copyright and public domain (out of copyright) materials. Currently, bookworm: HathiTrust is limited to searches of just single words.

Other Installations of bookworm:

bookworm: Open Library

Part of the Internet Archive. Its corpus includes books published between 1700-2009.

Corpus of Contemporary American English (COCA)

Corpus of Contemporary American English

Created by Mark Davies at Brigham Young University, this is one of the largest freely available corpus of English language materials with over 560 million words, 20 million per year, 1990-2017, equally divided among five genres: spoken, fiction, popular magazines, newspapers and academic texts.
Source materials include: Rolling Stone, Cosmopolitan, Redbook, and NPR transcripts among others as well as nearly 100 different peer-reviewed journals selected to cover the entire range of the LCCS. Search by words, phrases, and lemmas. Allows use of wildcards and other complex searches. To see how two words differ in meaning and usage use the compare search option.
Search results display either in a list of all matching strings or in horizontal and vertical charts showing frequency of occurrence of the word or phrase in the five genres. The charts also allow further refinement in subgenres such as women’s magazines or movie scripts. Context sensitive help is also provided.
Free to search but a free login is required to access some of the interactive features at this site. The default "non-researcher" status gives you 50 queries a day, or about 1,500 queries per month.

Other Corpora:

Corpus of Historical American English (COHA): COHA contains more than 400 million words of text from the 1810s--2000s. Bibliography: https://www.english-corpora.org/coha/

Other English Corpora (includes links to these individual corpora):

iWeb: "The iWeb corpus contains 14 billion words (about 25 times the size of COCA) in 22 million web pages. It is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. Unlike other large corpora from the web, the nearly 95,000 websites in iWeb were chosen in a systematic way."

News on the Web (NOW): "The NOW corpus (News on the Web) contains 7.9 billion words of data from web-based newspapers and magazines from 2010 to the present time. More importantly, the corpus grows by about 140-160 million words of data each month (from about 300,000 new articles), or about 1.8 billion words each year."

Global Web-Based English (GloWbE): "The corpus of Global Web-based English (GloWbE; pronounced "globe") is unique in the way that it allows you to carry out comparisons between different varieties of English. GloWbE is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. GloWbE contains about 1.9 billion words of text from twenty different countries."

Wikipedia Corpus: "This corpus contains the full text of Wikipedia, and it contains 1.9 billion words in more than 4.4 million articles. But this corpus allows you to search Wikipedia in a much more powerful way than is possible with the standard interface. You can search by word, phrase, part of speech, and synonyms. You can also find collocates (nearby words), and see re-sortable concordance lines for any word or phrase."

Hansard Corpus: "The Hansard corpus contains nearly every speech given in the British Parliament from 1803-2005, and it allows you to search these speeches (including semantically-based searches) in ways that are not possible with any other resource."

Early English Books Online: "Early English Books Online (EEBO) is a collection of texts created by the Text Creation Partnership. The "open source" version that we have at this site contains 755 million words in 25,368 texts from the 1470s to the 1690s."

The TV Corpus: "The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc."

The Movie Corpus: "The Movie Corpus contains 200 million words of data in more than 25,000 movies from the 1930s to the current time. All of the 25,000+ movies are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, rating, genre, plot summary, etc."

Corpus of Supreme Court Opinions: "This corpus contains approximately 130 million words in 32,000 Supreme Court decisions from the 1790s to the current time. This corpus was released in March 2017. In Sep 2018, a similar corpus (covering the same period) was released by the BYU Law School."

TIME Magazine Corpus: "The TIME corpus is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006, and it serves as a great resource to examine changes in American English during this time."

Corpus of American Soap Operas: "The SOAP corpus contains 100 million words of data from 22,000 transcripts from American soap operas from the early 2000s."

British National Corpus (BNC): "The British National Corpus (BNC) was originally created by Oxford University press in the 1980s - early 1990s, and it contains 100 million words of text texts from a wide range of genres (e.g. spoken, fiction, magazines, newspapers, and academic)."

Strathy Corpus (Canada): "The Strathy Corpus of Canadian English is a product of the Strathy Language Unit at Queen's University. The corpus contains 50 million words from more than 1,100 spoken, fiction, magazines, newspapers, and academic texts."

CORE Corpus: "This corpus contains more than 50 million words of text from the web and this is the first large web-based corpus that is carefully categorized into many different registers."

JSTOR Data for Research (DfR)

JSTOR Data for Research (DfR)

The JSTOR corpus is freely available and includes data from scholarly journal literature (more than 12 million articles), primary sources (26,000 19th Century British Pamphlets), and Books. However, the DIY, create your own dataset option. does require a free registration, This option allows the analysis of metadata (with references) and word counts, bi-grams, and tri-grams for up to 25,000 articles at a time.

Once you've created your dataset, the raw data is downloaded to a zipfile which is sent to your email account. The zipfile then must be uploaded to a tool like Voyant for further analysis; Custom datasets may also be created.

Bibliography

Library DH Support:

O'Leary, M. (2014). Cool Corpora. Information Today, 31(10), 16.

Russell, J. (2017). Supporting digital humanities: the basics. Online Searcher, (2), 49-52.

Secker, J., Morrison, C. M., Stewart, N., & Horton, L. (2016). To boldly go… the librarian's role in text and data mining. CILIP Update Magazine.

Text Mining Examples:

Craig, K. (2018). Introduction to Bookworm; Robots Reading Vogue; Bookworm: HathiTrust; Bookworm: Open Library; Building a Bookworm. Journal of American History, 105(1), 244–247. https://doi.org/10.1093/jahist/jay139

Genovese, J. E. C. (2015). Interest in Astrology and Phrenology over Two Centuries: A Google Ngram Study. Psychological Reports, 117(3), 940–943. https://doi.org/10.2466/17.PR0.117c27z8

Google Ngram Viewer Shows Changing Language Around Distance Learning. (2014). Distance Education Report, 18(16), 6.

Laudun, J., & Goodwin, J. (2013). Computing folklore studies: Mapping over a century of scholarly production through topics. The Journal of American Folklore, 126(502), 455-475. [JSTOR Data for Research]

Regan, A. (2017). Mining Mind and Body: Approaches and Considerations for Using Topic Modeling to Identify Discourses in Digitized Publications. Journal of Sport History, 44(2), 160-177.

Younes, N., & Reips, U.-D. (2019). Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms. PLoS ONE, 14(3), 1–17. https://doi.org/10.1371/journal.pone.0213554