NGC | Catalog
CatalogModelsGoogle Speech Commands v2 - MatchboxNet 3x2x1

Google Speech Commands v2 - MatchboxNet 3x2x1

Logo for Google Speech Commands v2 - MatchboxNet 3x2x1
Description
Checkpoint of MatchboxNet 3x2x1 trained on Google Speech Command v2 (35 classes) dataset
Publisher
NVIDIA
Latest Version
1
Modified
April 4, 2023
Size
766.9 KB

Speech Commands (v2 dataset)

Speech Command Recognition is the task of classifying an input audio pattern into a discrete set of classes. It is a subset of Automatic Speech Recognition, sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain "command" classes. Upon detection of these commands, a specific action can be taken by the system. It is often the objective of command recognition models to be small and efficient, so that they can be deployed onto low power sensors and remain active for long durations of time.

This Speech Command recognition tutorial is based on the QuartzNet model with a modified decoder head to suit classification tasks. Instead of predicting a token for each time step of the input, we predict a single label for the entire duration of the audio signal. This is accomplished by a decoder head that performs Global Max / Average pooling across all timesteps prior to classification. After this, the model can be trained via standard categorical cross-entropy loss.

  1. Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)
  2. Data augmentation using SpecAugment to increase number of data samples.
  3. Develop a small Neural classification model which can be trained efficiently.

A Jupyter Notebook containing all the steps to download the dataset, train a model and evaluate its results is available at : Speech Commands Using NeMo

Model Results

MatchboxNet 3x2x1

  • Parameter Count: 93K parameters
  • Accuracy : 97.2921 %