SSL En Conformer Large

For downloads and more information, please view on a desktop device.

Description

Self-Supervised Learning (SSL) checkpoints for Conformer Large model. These are similar to w2v-Conformer model and can be fine-tuned for Automatic Speech Recognition (ASR).

Publisher

NVIDIA

Latest Version

1.10.1

Modified

April 4, 2023

Size

616.77 MB

Model Overview

This collection contains Self-Supervised Learning (SSL) checkpoints for xlarge size versions of Conformer model (around 120M parameters). Models are trained using unlabeled english audio with contrastive loss. These are similar to w2v-Conformer [3,4] and can be fine-tuned for Automatic Speech Recognition (ASR).

Model Architecture

For details about conformer architecture, refer to [1].

Training

The NeMo toolkit [2] was used for training the models. These model are trained with this example script and this base config.

Datasets

All the models in this collection are trained using LibriLight corpus (~56k hrs of unlabeled English speech).

How to Use this Model

The pre-trained checkpoint is available in NeMo toolkit [3], and has to be fine-tuned on another labeled dataset for ASR.

To load the checkpoint from NGC

import nemo.collections.asr as nemo_asr
ssl_model = nemo_asr.models.ssl_models.SpeechEncDecSelfSupervisedModel.from_pretrained(model_name='ssl_en_conformer_large')

To continue ssl training on your own dataset, set init_from_pretrained_model and optim in config appropriately and use the script.

Fine-tune

To fine-tune using a labeled dataset, refer to this example script for transducer loss and to this example script for using CTC loss.

Briefly, you can load the pre-trained checkpoint into fine-tune model as shown below

# define fine-tune model
asr_model = nemo_asr.models.EncDecRNNTBPEModel(cfg=cfg.model, trainer=trainer)

# load ssl checkpoint
asr_model.load_state_dict(ssl_model.state_dict(), strict=False)

del ssl_model

Performance

The list of the available models in this collection is shown in the following table. Performances of the ASR models fine-tuned from these check-points are reported in terms of Word Error Rate (WER%) with greedy decoding.

Version	SSL Loss	Fine-tune Dataset	Fine-tune Model	Vocabulary Size	LS dev-clean	LS dev-other	LS test-clean	LS test-other
1.10.0	Contrastive	LS 100h	Conformer-Transducer	128	3.25	6.64	3.19	6.39
1.10.0	Contrastive	LS 960h	Conformer-Transducer	128	1.92	4.38	2.02	4.50
1.10.1	Contrastive+MLM	LS 100h	Conformer-Transducer	128	3.04	6.08	3.17	6.17
1.10.1	Contrastive+MLM	LS 960h	Conformer-Transducer	128	1.91	4.33	1.96	4.20

Limitations

Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

References

[1] Conformer: Convolution-augmented Transformer for Speech Recognition

[2] NVIDIA NeMo Toolkit

[3] Pushing the Limits of SSL for ASR

[4] W2V-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-training

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.