NGC | Catalog
CatalogModelsEnglish Tagger-based Inverse Text Normalization

English Tagger-based Inverse Text Normalization

Logo for English Tagger-based Inverse Text Normalization
Description
English single-pass tagger-based model for inverse text normalization based on bert-base-uncased, trained on 2 mln sentences from Google Text Normalization Dataset, achieves 3.75% WER on Google default test set
Publisher
NVIDIA
Latest Version
1.9.0
Modified
April 4, 2023
Size
424.78 MB

Model Overview

This model is a single-pass tagger-based model for inverse text normalization based on bert-base-uncased, trained on 2 mln sentences from Google Text Normalization Dataset.

It converts text from spoken domain into its written form: Input: "on may third we paid one hundred and twenty three dollars" Output: "on may 3 we paid $123"

Model Architecture

Thutmose Tagger is a single-pass tagging model. It utilizes a backbone BERT encoder (bert-base-uncased) followed by two classification heads: one is trained to predict written fragments as replacement tags, the other is trained to predict tags representing semiotic classes, like DATE, CARDINAL etc. The final tags have one-to-one correspondence with input words.

Training

The NeMo toolkit [1] was used for training the models. The training corpus for the model consists of 2 mln sentences.

This model is trained with this example script and this base config.

python [NEMO_GIT_FOLDER]/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_train.py \  
  lang=en \
  data.validation_ds.data_path=${DATA_PATH}/valid.tsv \
  data.train_ds.data_path=${DATA_PATH}/train.tsv \
  data.train_ds.batch_size=128 \
  model.language_model.pretrained_model_name=bert-base-uncased \
  model.label_map=${DATA_PATH}/label_map.txt \
  model.semiotic_classes=${DATA_PATH}/semiotic_classes.txt \
  trainer.max_epochs=5

Datasets

The initial dataset is Google Text Normalization Dataset. The dataset preparation is described example script and includes running GIZA++ automatic alignment tool to find granular alignments between spoken words and written fragments.

Performance

The performance of ITN models can be measured using Word Error Rate(WER), and Sentence Accuracy. We measure Sentence Accuracy w.r.t. multi-variant reference and subdivide the errors into "digit" and "other".

The model obtains the following scores on the following evaluation datasets

Default test set
    WER:  3.75%
    Sentence accuracy:  97.47%
        digit errors:    0.35%
        other errors:    2.18%


Hard test set
    WER:  8.98%
    Sentence accuracy:  87.88%
        digit errors:    3.12%
        other errors:    9.00%

Note that reference files were taken from the test part of Google TN Dataset, which is not considered 100% correct because of its synthetic nature. So these scores are not particularly indicative of the quality of final inverse text normalization, but they are a useful proxy.

How to Use this Model

The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference.

Automatically load the model from NGC

import nemo.collections.nlp as nemo_nlp
nlp_model = nemo_nlp.models.ThutmoseTaggerModel.from_pretrained(model_name="itn_en_thutmose_bert")

Running inference with this model

Example of inference and evaluation is provided in this [script] (https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/text_normalization_as_tagging/run_infer.sh)

python [NEMO_GIT_FOLDER]/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \  
  pretrained_model="itn_en_thutmose_bert" \  
  inference.from_file="<INPUT_FILE>" \  
  inference.out_file="<OUTPUT_FILE>" \  
  model.max_sequence_len=1024 \  
  inference.batch_size=128

Input

This model expects the input file to be plain text without punctuation, similar to the ASR text output.

Output

This model provides an output file with exactly the same number of lines as in input. Each line consists of 5 columns:

  1. Final output text.
  2. Input text.
  3. Sequence of predicted tags.
  4. Sequence of tags after post-processing (some swaps may be applied).
  5. Sequence of predicted semiotic classes - one class for each input word.

Limitations

Since this model was trained on syntetic data its performance might degrade on some constructions that were systematically biased in the initial corpus.

References

[1] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.