Guides @ UF: Text and Data Mining: Terminology and Examples

Terminology

Bag of Words (BoW): A simplifying representation of text data that describes the occurrence of words within a document, ignoring grammar and word order but keeping multiplicity.

Clustering: A machine learning technique used to group a set of objects (like documents) into clusters, such that objects in the same cluster are more similar to each other than to those in other clusters.

Corpus: A collection of texts in a structured, machine readable format intended to be used in text mining. This term may also refer to a general collection of written works, often by the same author or on the same topic.

Data/Text Mining: The process of deriving meaningful information from unstructured text or large data sets. It involves various techniques to transform text into data for analysis.

DOI (Digital object identifier): A unique identifying number for a document, such as a journal article, that acts as a static link to the original digital object. DOIs are often preferred to URLs where possible due to their persistence against issues such as link rot.

Feature Extraction: The process of transforming raw data into a set of attributes or features that can be used in machine learning models. In text mining, this often involves identifying important words or phrases.

Graphical User Interface (GUI): a software program that allows users to interact with a device/computer through visual elements like buttons, menus, and icons.

Named Entity Recognition (NER): A process in NLP that involves identifying and classifying key elements in text into predefined categories, such as names of people, organizations, or locations.

Natural Language Processing (NLP): A field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It involves enabling computers to understand, interpret, and generate human language.

OCR (Optical character recognition): The use of computer technology to recognise text within a digital image and make it machine-readable. This allows for greater accessibility of physical texts that have stored digitally.

Stemming: A technique used to reduce words to their root form. For example, "running" becomes "run." This helps in standardizing words for analysis.

Term Frequency-Inverse Document Frequency (TF-IDF): A statistical measure used to evaluate the importance of a word in a document relative to a collection (or corpus) of documents. It helps in identifying relevant words for further analysis.

Tokenization: The process of breaking down text into smaller units, or tokens, such as words or phrases. This is often the first step in text processing.

List of terms and definitions generated with ChatGPT and edited by a human librarian, 10/31/2024.

Examples

EarlyPrint Lab
Provides an introduction and explanation of the various tools and visualizations that are possible using the EEBO Text Creation Partnership using the XML/SGML encoded transcriptions of early printed books in Early English Books Online. Presents examples with tools and visualizations, such as the N-gram browser and corpus analysis of English print culture before 1700 in EEBO-TCP
Robots Reading Vogue
Few magazines can boast being continuously published for over a century, familiar and interesting to almost everyone, full of iconic pictures — and also completely digitized and marked up as both text and images. What can you do with over 2,700 covers, 400,000 pages, 6 TB of data? Students, librarians and faculty are excited about the possibilities of working with Vogue to explore questions in fields from gender studies to computer science.
The Viral Texts Project
Explores reprinting of texts in nineteenth-century United States newspapers and journals drawing from the Chronicling America: Historic American Newspapers collection at the Library of Congress. Northeastern University.