SLU Conformer-Transformer-Large SLURP

For downloads and more information, please view on a desktop device.

Description

This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation.

Publisher

NVIDIA

Latest Version

1.13.0

Modified

April 4, 2023

Size

489.12 MB

Model Overview

Model Architecture

The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details here), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.

Training

The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this example script and this base config.

The tokenizers for these models were built using the semantics annotations of the train set with this script. We use a vocabulary size of 58, including the BOS, EOS and padding tokens.

Datasets

The model is trained on the combined real and synthetic training sets of the SLURP dataset.

Performance

				Intent (Scenario_Action)		Entity			SLURP Metrics
Version	Model	Params (M)	Pretrained	Accuracy	Precision	Recall	F1	Precsion	Recall	F1
1.13.0	Conformer-Transformer-Large	127	NeMo ASR-Set 3.0	90.14	78.95	74.93	76.89	84.31	80.33	82.27
Baseline	Conformer-Transformer-Large	127	None	72.56	43.19	43.5	43.34	53.59	53.92	53.76

Note: during inference, we use beam size of 32, and a temperature of 1.25.

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")

Predict intents and slots with this model

python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
 pretrained_name="slu_conformer_transformer_slurp" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
 sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
 sequence_generator.beam_size="<SIZE OF BEAM>" \
 sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"

Input

This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

Output

This model provides the intent and slot annotaions as a string for a given audio sample.

Limitations

Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.

References

[1] SLURP: A Spoken Language Understanding Resource Package

[2] Conformer: Convolution-augmented Transformer for Speech Recognition

[3] Attention Is All You Need

[4] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.