Bag of Words (BoW): A simplifying representation of text data that describes the occurrence of words within a document, ignoring grammar and word order but keeping multiplicity.
Clustering: A machine learning technique used to group a set of objects (like documents) into clusters, such that objects in the same cluster are more similar to each other than to those in other clusters.
Corpus: A collection of texts in a structured, machine readable format intended to be used in text mining. This term may also refer to a general collection of written works, often by the same author or on the same topic.
Data/Text Mining: The process of deriving meaningful information from unstructured text or large data sets. It involves various techniques to transform text into data for analysis.
DOI (Digital object identifier): A unique identifying number for a document, such as a journal article, that acts as a static link to the original digital object. DOIs are often preferred to URLs where possible due to their persistence against issues such as link rot.
Feature Extraction: The process of transforming raw data into a set of attributes or features that can be used in machine learning models. In text mining, this often involves identifying important words or phrases.
Graphical User Interface (GUI): a software program that allows users to interact with a device/computer through visual elements like buttons, menus, and icons.
Named Entity Recognition (NER): A process in NLP that involves identifying and classifying key elements in text into predefined categories, such as names of people, organizations, or locations.
Natural Language Processing (NLP): A field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It involves enabling computers to understand, interpret, and generate human language.
OCR (Optical character recognition): The use of computer technology to recognise text within a digital image and make it machine-readable. This allows for greater accessibility of physical texts that have stored digitally.
Stemming: A technique used to reduce words to their root form. For example, "running" becomes "run." This helps in standardizing words for analysis.
Term Frequency-Inverse Document Frequency (TF-IDF): A statistical measure used to evaluate the importance of a word in a document relative to a collection (or corpus) of documents. It helps in identifying relevant words for further analysis.
Tokenization: The process of breaking down text into smaller units, or tokens, such as words or phrases. This is often the first step in text processing.
List of terms and definitions generated with ChatGPT and edited by a human librarian, 10/31/2024.
![]() |