Skip to Main Content

Artificial Intelligence (AI): Natural Language Processing

ARCS-AI (Academic Search Consulting and Services Department) is a library group that offers introductory level short AI tutorials and projects for the UF community

What is Natural Language Processing (NLP)?:

A subfield of linguisticscomputer science, and artificial intelligence is concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Common NLP tasks:

  • Part-Of-Speech tagging (POS): labeling each word with a unique tag that indicates its linguistic role.
  • Chunking (CHUNK): also called shallow parsing, used to identify parts of speech and short phrases present in a given sentence.
  • Named Entity Recognition (NER): labeling elements in a sentence into categories.
  • Language Identification: identifying the language of text or sound voice.
  • Information Extraction: Extract from unstructured texts to entities, relations, events, sentiments, etc., to form structured information. 
  • Text Generation: summarizations, machine translation, grammar/spelling correction, question answering, chatbot, etc.
  • Recommendation system: analyze past text reviews for future recommendations on buying, liking, watching, etc.
  • Sentiment analysis: identifies the emotional (e.g. positive, negative) behind a body of text.
  • AI Chatbot: task-based agents that help humans by taking over repetitive and time-consuming communications.
  • Document Similarity: measuring how similar documents or groups of texts are to each other.

NLP Tools:

 

HuggingFace               

                  
Natural language Tool Kit (NLTK       
  • A leading platform for building Python programs to work with natural language data.
  • Over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
  • Tutorials
PyTorch          
  • A Python library for building deep learning (one kind of machine learning) projects.
  • PyTorch is installed on HiperGator.
TensorFlow                 
  • A collection of workflows to develop and train models using Python or JavaScript, etc.
  • The tf. data API is for building complex input pipelines from simple, reusable pieces.
Keras       
  • It offers simpler APIs that minimize the number of user actions required for common use cases.
  • Tutorials
Orange dataminingOrange 
  • Friendly user interface
  • Official Video Tutorial (Tasks: Word embeddings, sentiment analysis, text classification, text preprocessing, clustering)

CLAMP (Clinical Language Annotation, Modeling, and Processing Toolkit)

Clinical Language Annotation, Modeling, and Processing Toolkit, is a comprehensive clinical Natural Language Processing (NLP) software that enables recognition and automatic encoding of clinical information in narrative patient reports.

ConvoKit (Cornell Conversational Analysis Toolkit)

  • It contains tools to extract conversational features and analyze social phenomena in conversations. 
  • Demo Link

 

Corpora:

Did you know UF HiPerGator stores NLP tools/datasets which you can use right away? Learn more: https://help.rc.ufl.edu/doc/NLP

The Pile Corpus from eleuther.ai (/data/ai/text/data/thepile) for language modeling (paper)

AI-news collection ( /data/ai/text/data/newsfrom Bing, Duckduckgo, and Google search engines. (this set is updated weekly, contact Eric at UFRC ericeric@ufl.edu for more information)

UF Baldwin Library digital collections (/data/ai/text/data/baldwin_library)


Fact-checking datasets: open-sourced FEVER (Fact Extraction and VERification) contains claims labeled as "Supported", "Refuted" or "NotEnoughInfo". Three datasets available (June 2023 checked) from 2018, 2019, and 2021.

Common sense Q and A: NLP-Progress repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. 

Kaggle NLP Datasets: Explore, analyze, and share quality data. 

The open parallel corpus (OPUS)
  • A growing collection of translated texts from the web. 
  • OPUS is based on open source products and the corpus is also delivered as an open content package.
  • All pre-processing is done automatically. No manual corrections have been carried out.
 
Penn Treebank (PTB)
  • The project selected 2,499 stories from a three-year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. 
  • suitable for tasks such as POS tagging and syntactic annotations.
 
CoNLL2003 
  • It concentrates on four types of named entities: persons, locations, organizations, and names of miscellaneous entities that do not belong to the previous three groups.
 
The Stanford Sentiment Treebank(SST) 
  • 9645 sentences, sentiments rated on a scale between 1 and 25, where 1 is the most negative and 25 is the most positive.

NLP News and other resources

    NLP Podcast 
Sebastian Ruder's Blog 
Snyk.io gives a health score to a package based on its popularity, maintenance, security, and community.
University of Florida Home Page

This page uses Google Analytics - (Google Privacy Policy)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.