Voice activity detection (VAD) is the task of distinguishing human speech segments from background noise in audio stream.

VAD is an important pre-processing stage of an ASR system to decide when to start ASR and when to close the microphone. 
The models need to be small and efficient so that they can be deployed onto devices. Also, VAD requires low latency.

This VAD tutorial is based on the MatchboxNet model with a modified decoder head to suit classification tasks. MatchboxNet shows the great performance on classifying the segment (utterance-level / second unit) to be speech or non-speech (99.78% F1 score). 

A Jupyter Notebook containing all the steps to download the dataset, train a model and evaluate its results is available at: [VAD Using Nemo](https://github.com/NVIDIA/NeMo/blob/master/examples/asr/notebooks/6_VAD_using_NeMo.ipynb)


# Model Results
Accuracy: 0.9971

F1 score : 0.9975

vad_matchboxnet_3x1x1

Checkpoint of MatchboxNet 3x1x1 trained on Google Speech Command v2 (Speech) and Freesound (Background) dataset

VAD - MatchboxNet 3x1x1

Model Results