NGC | Catalog
CatalogModelsSpeech to Text English Jasper

Speech to Text English Jasper

Logo for Speech to Text English Jasper
Description
Speech to Text Jasper model for English.
Publisher
NVIDIA
Latest Version
deployable_v1.2
Modified
April 4, 2023
Size
1.15 GB

Speech Recognition: Jasper Model Card ======================================

Model Overview --------------

Jasper models are end-to-end neural automatic speech recognition (ASR) models that transcribe segments of audio to text.

Model Architecture ------------------

These models are based on the Jasper architecture. They are acoustic, end-to-end neural speech recognition models trained with CTC loss.

Jasper models take in audio segments and transcribe them to letter, byte pair, or word piece sequences. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.

Intended Use ------------

Primary use case intended for these models is automatic speech recognition.

Input Single-channel audio files (WAV) with a 16kHz sample rate

Output Transcripts, which are sequences of valid vocabulary labels as given by the specification file

How to Use This Model ---------------------

Jasper is an end-to-end architecture that is trained using CTC loss. These model checkpoints are intended to be used with the Transfer Learning Toolkit (TLT). In order to use these checkpoints, there should be a specification file (.yaml) that specifies hyperparameters, datasets for training and evaluation, and any other information needed for the experiment. For more information on the experiment spec files for each use case, please refer to the Transfer Learning Toolkit User Guide.

The model is encrypted and will only operate with the model encryption key tlt_encode.

To fine-tune from a model checkpoint (.tlt), use the following command:

!tlt speech-to-text finetune -e <experiment_spec> \
                             -m <model_checkpoint> \
                             -g <num_gpus>

Where the <experiment_spec> parameter should be a valid path to the file that specifies the fine-tuning hyperparameters, the dataset to fine-tune on, the dataset to evaluate on, and whether or not a change of vocabulary from the default (lowercase English letters, space, and apostrophe) is needed.

To evaluate an existing dataset using a model checkpoint (.tlt), use the following command:

!tlt speech-to-text evaluate -e <experiment_spec> \
                             -m <model_checkpoint> \
                             -g <num_gpus>

The <experiment_spec> parameter should be a valid path to the file that specifies the dataset that is being evaluated.

Training Information --------------------

This Jasper model was trained on a combination of seven datasets of English speech, with a total of 7,133 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 600 epochs with Apex/Amp optimization level O1.

The model achieves a Word Error Rate (WER) of 3.74% on LibriSpeech dev-clean, and a WER of 10.21% on LibriSpeech dev-other.

The model has been fine-tuned with Room Impulse Response (RIR) and noise augmentation to make it more robust to noise.

The datasets included in training are detailed in the table below.

Dataset Duration (h)
LibriSpeech 961
Wall Street Journal 81
Fisher English Training Speech 1,906
Switchboard 316
Mozilla Common Voice* 1,039
Appen English Speech 972
NSC Singapore English (Part 1) 1,857

* Only non-dev and non-test validated clips from Mozilla Common Voice version en_1488h_2019-12-10.

Limitations -----------

Currently, TLT Jasper models only support training and inference on .wav audio files. All models included here were trained and evaluated on audio files with a sample rate of 16kHz, so for best performance you may need to upsample or downsample audio files to 16kHz.

In addition, the model will perform best on audio samples that are longer than 0.1 seconds long. For training and fine-tuning Jasper models, it is recommended that samples are capped at a maximum length of around 15 seconds, depending on the amount of memory available to you. You do not need to place a maximum length limitation for evaluation.

License -------

By downloading and using the models and resources packaged with TLT Conversational AI, you would be accepting the terms of the Jarvis license

Suggested Reading -----------------

Ethical AI

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.