Skip to Main Content

Artificial Intelligence (AI): Natural Language Processing

ARCS (Academic Search Consulting and Services Department) is a library group that offers introductory level short AI tutorials and projects for the UF community

Trends in NLP? ChatGPT!

>> The GPT models! - ChatGPT is a powerful tool but "lies" sometimes.

>> How come can something be unreliable and powerful at the same time?

>> Depending on what and how we use it.

>> UFIT Tutorial on The role of ChatGPT in Assessment of students learning (April 11, 2023) -- Click to access the slides

>> How UF regulates the way of using ChatGPT?

>> From UF IRM (Integrated Risk Management) :

Any data classified as sensitive or restricted should not be used. This includes, but is not limited to the following data types:

  • Social Security Numbers
  • Education Records
  • Employee Data
  • Credit Card Numbers
  • Protected Health Information
  • Human Subject Research Data
  • Unpublished Research Data
  • Personal Identifiable Information

>> What about ChatGPT is helpful for librarians?

>> Share your thoughts with us! A piece of thought from me: ChatGPT, an Opportunity to Understand More About Language Models

Natural Language Processing Basics

What is Natural Language Processing (NLP)?

A subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

 

Common NLP tasks:

  • Part-Of-Speech tagging (POS): labeling each word with a unique tag that indicates its linguistic role.
  • Chunking (CHUNK): also called shallow parsing, used to identify parts of speech and short phrases present in a given sentence.
  • Named Entity Recognition (NER): labeling elements in a sentence into categories.
  • Language Identification: identifying the language of text or sound voice.
  • Information Extraction: Extract from unstructured texts to entities, relations, events, sentiments, etc., to form structured information. 
  • Text Generation: summarizations, machine translation, grammar/spelling correction, question answering, chatbot, etc.
  • Recommendation system: analyze past text reviews for future recommendations on buying, liking, watching, etc.
  • Sentiment analysis: identifies the emotional (e.g. positive, negative) behind a body of text.
  • AI Chatbot: task-based agents that help humans by taking over repetitive and time-consuming communications.
  • Document Similarity: measuring how similar documents or groups of texts are to each other.

 

NLP Tools: 

 

HuggingFace                                  

       

 

Natural language Tool Kit (NLTK       
  • A leading platform for building Python programs to work with natural language data.
  • Over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
  • Tutorials

 

PyTorch          
  • A Python library for building deep learning (one kind of machine learning) projects.
  • PyTorch is installed on HiperGator.
TensorFlow                              
  • A collection of workflows to develop and train models using Python or JavaScript, etc.
  • The tf. data API is for building complex input pipelines from simple, reusable pieces.
Keras                               
  • It offers simpler APIs that minimize the number of user actions required for common use cases.
  • Tutorials

 

Orange dataminingOrange 
  • Friendly user interface
  • Official Video Tutorial (Tasks: Word embeddings, sentiment analysis, text classification, text preprocessing, clustering)
 

 

Datasets/Corpora

Datasets: 

Did you know UF HiPerGator stores NLP tools/datasets which you can use right away? Learn more: https://help.rc.ufl.edu/doc/NLP

The Pile Corpus from eleuther.ai (/data/ai/text/data/thepile) for language modeling (paper)

AI-news collection ( /data/ai/text/data/newsfrom Bing, Duckduckgo, and Google search engines. (this set is updated weekly, contact Eric at UFRC ericeric@ufl.edu for more information)

UF Baldwin Library digital collections (/data/ai/text/data/baldwin_library)


Fact-checking datasets: open-sourced FEVER (Fact Extraction and VERification) contains claims labeled as "Supported", "Refuted" or "NotEnoughInfo". Three datasets available (June 2023 checked) from 2018, 2019, and 2021.

Common sense Q and A: NLP-Progress repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. 

Kaggle NLP Datasets: Explore, analyze, and share quality data. 

The open parallel corpus (OPUS)
  • A growing collection of translated texts from the web. 
  • OPUS is based on open source products and the corpus is also delivered as an open content package.
  • All pre-processing is done automatically. No manual corrections have been carried out.
 
Penn Treebank (PTB)
  • The project selected 2,499 stories from a three-year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. 
  • suitable for tasks such as POS tagging and syntactic annotations.
 
CoNLL2003 
  • It concentrates on four types of named entities: persons, locations, organizations, and names of miscellaneous entities that do not belong to the previous three groups.
 
The Stanford Sentiment Treebank(SST) 
  • 9645 sentences, sentiments rated on a scale between 1 and 25, where 1 is the most negative and 25 is the most positive.

NLP News and other resources

    NLP Podcast 
Sebastian Ruder's Blog 
Snyk.io gives a health score to a package based on its popularity, maintenance, security, and community.
University of Florida Home Page

This page uses Google Analytics - (Google Privacy Policy)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.