Guides @ UF: Text and Data Mining: Tools for Text and Data Mining

Tools

Altair RapidMiner
Altair RapidMiner is a set of data analytics and machine learning tools that help users make more use of their data via Data Preparation, Machine Learning, Text Analytics, Big Data Integration, Predictive Analytics and Model Deployment.
Orange
Open-source machine learning and data visualization.
Python
Python is a general-purpose, open-source programming language used in a variety of applications, including: web and software development, data science, and machine learning.
R
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
Rattle
Rattle is a popular Graphical User Interface (GUI) for data mining using R.
Research Data Management at UF: Tools
Research guide covering data management tools and policies at UF

Methods

Classification: A data mining method that separates points of data into different classes, using algorithms that link qualitative variables.

Clustering analysis: A method in both text and data mining in which words or data points are grouped based on similarity. These groups are "clusters" and contains words or points that are more similar to one another than anything appearing in other clusters.

Collocation analysis: A text mining method highlighting words or terms that frequently appear together, or are "associated" with one another. A group of words that can be characterized in this way are referred to as a collocation. This can provide insight into the meanings associated with particular words throughout a corpus.

Concordance analysis: Also known as keywords in context (KWIC), this is a text mining method which generates a list of any given word along with the context in which it appears. This is presented in the form of a certain number of words before and after the keyword for each of its appearances.

n-gram: A collection of successive items, usually words, but also numbers, symbols, or punctuation, from any given sequence of text, which indicates a pattern over time. N-grams are usually analyzed using term frequency methods.

Named entity recognition (NER): Also referred to as automatic name recognition, this is a method of text mining in which names of people, places, and things are identified and organized into pre-defined categories.

Part-of-speech tagging: A text mining tool in which information about the parts of speech in a corpus is identified, such as the occurrence of nouns and verbs, as well as the grammatical characteristics of words such as their tense, number, and case.

Sentiment analysis: Also known as opinion mining, this text mining method involves the identification of opinions and emotions in a text using a scoring system in order to determine the overall tone.

Term frequency: A text mining method identifying the number of times a particular word or phrase appears within a document or corpus, also known as frequency distribution. This method may be combined with inverse document frequency in which a particular frequently occurring word is offset by the number of documents or texts in the corpus containing that word. This is referred to as TF-IDF and is helpful for identifying words unique to particular texts in a corpus, as opposed to words that appear frequently throughout the entire corpus.

Topic modelling: A text mining method in which the topics of documents or texts are inferred from words that tend to appear together. There are several algorithms used to perform this kind of analysis, with the most popular being Latent Dirichlet Allocation (LDA). Topic modeling made simple enough by tedunderwood provides a more in-depth explanation of this method.

Source: https://libguides.adelaide.edu.au/c.php?g=964459&p=7007613

Text and Data Mining

Principles and Ethics

UF Resources

Tools

Methods