NGC | Catalog
CatalogModelsVAD - MatchboxNet 3x1x1

VAD - MatchboxNet 3x1x1

Logo for VAD - MatchboxNet 3x1x1
Description
Checkpoint of MatchboxNet 3x1x1 trained on Google Speech Command v2 (Speech) and Freesound (Background) dataset
Publisher
NVIDIA
Latest Version
1
Modified
April 4, 2023
Size
603.01 KB

Voice activity detection (VAD) is the task of distinguishing human speech segments from background noise in audio stream.

VAD is an important pre-processing stage of an ASR system to decide when to start ASR and when to close the microphone. The models need to be small and efficient so that they can be deployed onto devices. Also, VAD requires low latency.

This VAD tutorial is based on the MatchboxNet model with a modified decoder head to suit classification tasks. MatchboxNet shows the great performance on classifying the segment (utterance-level / second unit) to be speech or non-speech (99.78% F1 score).

A Jupyter Notebook containing all the steps to download the dataset, train a model and evaluate its results is available at: VAD Using Nemo

Model Results

Accuracy: 0.9971

F1 score : 0.9975