NGC | Catalog
CatalogModelsSpeakerDiarization Speakernet

SpeakerDiarization Speakernet

Logo for SpeakerDiarization Speakernet
Description
SpeakerNet-M model for Speaker Diarization inference
Publisher
NVIDIA
Latest Version
1.0.0rc1
Modified
April 4, 2023
Size
24.09 MB

Model Overview

Speaker Diarization (SD) is the task of segmenting audio recordings by speaker labels, that is Who Speaks When?

A diarization system consists of a Voice Activity Detection (VAD) model to get the time stamps of audio where speech is being spoken while ignoring the background noise and a Speaker Embeddings model to get speaker embeddings on speech segments obtained from VAD time stamps. These speaker embeddings would then be clustered into clusters based on number of speakers present in the audio recording.

Model Architecture

SpeakerDiarization in Nemo[3] currently supports only inference using pretrained SpekerNet[1] models and VAD[2] models. This model when combined with VAD(any) model or without VAD for ORACLE evaluation can be used for speaker diarization inference.

Training

We separately train MarbleNet and SpeakerNet models on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit [3] was used for training this model over few hundred epochs on multiple GPUs.

Datasets

The following datasets are used for training speakerNet model.

Performance

  • This model (speakerdiarization_speakernet) achieves Speaker Error Rate (SER) of 5.4% on CH109 set.
  • speakerverification_speakernet model achieves Speaker Error Rate (SER) of 4.1% on AMI Lapel test set.

How to use this model

Steps on loading nemo model for speaker embedding in order to perform oracle or non-oracle speaker diarization have been explained in this Notebook

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model outputs RTTM file with speaker labels and their time stamps.

Limitations

This model is trained on telephonic speech from voxceleb datasets,Fisher and switch board hence may not work as well for telephonic speech. For telephonic speech consider finetuning for that speech domain or try using speakerverification model.

References

[1] SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification
[2] MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
[3] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the license of the NeMo Toolkit [3]. By downloading the public and release version of the model, you accept the terms and conditions of this license.