NGC | Catalog
CatalogModelsNeMo Natural Language Processing models

NeMo Natural Language Processing models

Logo for NeMo Natural Language Processing models
Description
NeMo Natural Language Processing Models contain models for punctuation and capitalization, named entity recognition, text classification as well as base pretrained models
Publisher
NVIDIA
Latest Version
1.0.0a5
Modified
April 4, 2023
Size
4.43 GB

Overview

NVIDIA NeMo toolkit supports multiple Natural Language Processing(NLP) tasks from text classification and language modelling all the way to glue benchmarking. Natural Language Processing (NLP) field experienced a huge leap in recent years due to the concept of transfer learning enabled through pretrained language models. BERT, RoBERTa, Megatron-LM, and many other proposed language models achieve state-of-the-art results on many NLP tasks, such as: question answering, sentiment analysis, named entity recognition and many others. In NeMo, most of the NLP models represent a pretrained language model followed by a Token Classification layer or a Sequence Classification layer or a combination of both. By changing the language model, you can improve the performance of your final model on the specific downstream task you are solving. With NeMo you can use either pretrain a BERT model from your data or use a pretrained language model from HuggingFace transformers or Megatron-LM libraries. All NLP models require text tokenization as data preprocessing steps. The list of tokenizers can be found in nemo.collections.common.tokenizers, and include WordPiece tokenizer, SentencePiece tokenizer or simple tokenizers like Word tokenizer.

Language Modelling - Assigns a probability distribution over a sequence of words. Can be either generative e.g. left-right-transformer or BERT with a masked language model loss. Text Classification - Classifies an entire text based on its content into predefined categories, e.g. news, finance, science etc. These models are BERT-based and can be used for applications such as sentiment analysis, relationship extraction Token Classification - Classifies each input token separately. Models are based on BERT. Applications include named entity recognition, punctuation and capitalization, etc. Intent Slot Classification - used for joint recognition of Intents and Slots (Entities) for building conversational assistants. Question Answering - Currently only SQuAD is supported. This takes in a question and a passage as input and predicts a span in the passage, from which the answer is extracted. Glue Benchmarks - A benchmark of nine sentence- or sentence-pair language understanding tasks.

Use the Jupyter notebooks to quickly get started with using the pre-trained checkpoints or pretraining BERT.

Usage

You can instantiate all these models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.nlp as nemo_nlp

Then choose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...) method. For example:

pretrained_ner_model = nemo_nlp.models.TokenClassificationModel.from_pretrained(model_name="NERModel")

Note that you can also list all available models using API by calling base_class.list_available_models(...) method.

You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE) method. In this case, make sure you are matching NeMo and models' versions.

Here is a list of currently available models together with their base classes and short descriptions.

Model name Model Base Class Description
NERModel TokenClassificationModel The model was trained on GMB (Groningen Meaning Bank) corpus for entity recognition and achieves 74.61 F1 Macro score.
Punctuation_Capitalization_with_BERT TokenClassificationModel The model was trained with NeMo BERT base uncased checkpoint on a subset of data from the following sources: Tatoeba sentences, books from Project Gutenberg, Fisher transcripts.
Punctuation_Capitalization_with_DistilBERT TokenClassificationModel The model was trained with DiltilBERT base uncased checkpoint from HuggingFace on a subset of data from the following sources: Tatoeba sentences, books from Project Gutenberg, Fisher transcripts.
BERTBaseUncasedSQuADv1.1 QAModel Question answering model finetuned from NeMo BERT Base Uncased on SQuAD v1.1 dataset which obtains an exact match (EM) score of 82.43% and an F1 score of 89.59%.
BERTBaseUncasedSQuADv2.0 QAModel Question answering model finetuned from NeMo BERT Base Uncased on SQuAD v2.0 dataset which obtains an exact match (EM) score of 73.35% and an F1 score of 76.44%.
BERTLargeUncasedSQuADv1.1 QAModel Question answering model finetuned from NeMo BERT Large Uncased on SQuAD v1.1 dataset which obtains an exact match (EM) score of 85.47% and an F1 score of 92.10%.
BERTLargeUncasedSQuADv2.0 QAModel Question answering model finetuned from NeMo BERT Large Uncased on SQuAD v2.0 dataset which obtains an exact match (EM) score of 78.8% and an F1 score of 81.85%.
Joint_Intent_Slot_Assistant IntentSlotClassificationModel This models is trained on this https://github.com/xliuhw/NLU-Evaluation-Data dataset which includes 64 various intents and 55 slots. Final Intent accuracy is about 87%, Slot accuracy is about 89%.