Classification: A data mining method that separates points of data into different classes, using algorithms that link qualitative variables.
Clustering analysis: A method in both text and data mining in which words or data points are grouped based on similarity. These groups are "clusters" and contains words or points that are more similar to one another than anything appearing in other clusters.
Collocation analysis: A text mining method highlighting words or terms that frequently appear together, or are "associated" with one another. A group of words that can be characterized in this way are referred to as a collocation. This can provide insight into the meanings associated with particular words throughout a corpus.
Concordance analysis: Also known as keywords in context (KWIC), this is a text mining method which generates a list of any given word along with the context in which it appears. This is presented in the form of a certain number of words before and after the keyword for each of its appearances.
n-gram: A collection of successive items, usually words, but also numbers, symbols, or punctuation, from any given sequence of text, which indicates a pattern over time. N-grams are usually analyzed using term frequency methods.
Named entity recognition (NER): Also referred to as automatic name recognition, this is a method of text mining in which names of people, places, and things are identified and organized into pre-defined categories.
Part-of-speech tagging: A text mining tool in which information about the parts of speech in a corpus is identified, such as the occurrence of nouns and verbs, as well as the grammatical characteristics of words such as their tense, number, and case.
Sentiment analysis: Also known as opinion mining, this text mining method involves the identification of opinions and emotions in a text using a scoring system in order to determine the overall tone.
Term frequency: A text mining method identifying the number of times a particular word or phrase appears within a document or corpus, also known as frequency distribution. This method may be combined with inverse document frequency in which a particular frequently occurring word is offset by the number of documents or texts in the corpus containing that word. This is referred to as TF-IDF and is helpful for identifying words unique to particular texts in a corpus, as opposed to words that appear frequently throughout the entire corpus.
Topic modelling: A text mining method in which the topics of documents or texts are inferred from words that tend to appear together. There are several algorithms used to perform this kind of analysis, with the most popular being Latent Dirichlet Allocation (LDA). Topic modeling made simple enough by tedunderwood provides a more in-depth explanation of this method.
Source: https://libguides.adelaide.edu.au/c.php?g=964459&p=7007613
![]() |