Skip to content

Framework for training and evaluating self-supervised learning methods for speaker verification.

License

Notifications You must be signed in to change notification settings

theolepage/sslsv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

sslsv

sslsv is a PyTorch-based Deep Learning framework consisting of a collection of Self-Supervised Learning (SSL) methods for learning speaker representations applicable to different speaker-related downstream tasks, notably Speaker Verification (SV).

Our aim is to: (1) provide self-supervised SOTA methods by porting algorithms from the computer vision domain; and (2) evaluate them in a comparable environment.

Our training framework is depicted by the figure below.


News

  • April 2024 – πŸ‘ Introduction of new various methods and complete refactoring (v2.0).
  • June 2022 – 🌠 First release of sslsv (v1.0).

Features

General

  • Data:
    • Supervised and Self-supervised datasets (siamese and DINO sampling)
    • Audio augmentation (noise and reverberation)
  • Training:
    • CPU, GPU and multi-GPUs (DataParallel and DistributedDataParallel)
    • Checkpointing, resuming, early stopping and logging
    • Tensorboard and wandb
  • Evaluation:
    • Speaker verification
      • Backend: Cosine scoring and PLDA
      • Metrics: EER, MinDCF, ActDFC, CLLR, AvgRPrec
    • Classification (emotion, language, ...)
  • Notebooks: DET curve, scores distribution, t-SNE on embeddings, ...
  • Misc: scalable config, typing, documentation and tests
Encoders
  • TDNN (sslsv.encoders.TDNN)
    X-vectors: Robust dnn embeddings for speaker recognition (PDF)
    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur

  • Simple Audio CNN (sslsv.encoders.SimpleAudioCNN)
    Representation Learning with Contrastive Predictive Coding (arXiv)
    Aaron van den Oord, Yazhe Li, Oriol Vinyals

  • ResNet-34 (sslsv.encoders.ResNet34)
    VoxCeleb2: Deep Speaker Recognition (arXiv)
    Joon Son Chung, Arsha Nagrani, Andrew Zisserman

  • ECAPA-TDNN (sslsv.encoders.ECAPATDNN)
    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification (arXiv)
    Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck

Methods
  • LIM (sslsv.methods.LIM)
    Learning Speaker Representations with Mutual Information (arXiv)
    Mirco Ravanelli, Yoshua Bengio

  • CPC (sslsv.methods.CPC)
    Representation Learning with Contrastive Predictive Coding (arXiv)
    Aaron van den Oord, Yazhe Li, Oriol Vinyals

  • SimCLR (sslsv.methods.SimCLR)
    A Simple Framework for Contrastive Learning of Visual Representations (arXiv)
    Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton

  • MoCo v2+ (sslsv.methods.MoCo)
    Improved Baselines with Momentum Contrastive Learning (arXiv)
    Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He

  • DeepCluster v2 (sslsv.methods.DeepCluster)
    Deep Clustering for Unsupervised Learning of Visual Features (arXiv)
    Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze

  • SwAV (sslsv.methods.SwAV)
    Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (arXiv)
    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin

  • W-MSE (sslsv.methods.WMSE)
    Whitening for Self-Supervised Representation Learning (arXiv)
    Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe

  • Barlow Twins (sslsv.methods.BarlowTwins)
    Barlow Twins: Self-Supervised Learning via Redundancy Reduction (arXiv)
    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, StΓ©phane Deny

  • VICReg (sslsv.methods.VICReg)
    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning (arXiv)
    Adrien Bardes, Jean Ponce, Yann LeCun

  • VIbCReg (sslsv.methods.VIbCReg)
    Computer Vision Self-supervised Learning Methods on Time Series (arXiv)
    Daesoo Lee, Erlend Aune

  • BYOL (sslsv.methods.BYOL)
    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (arXiv)
    Jean-Bastien Grill, Florian Strub, Florent AltchΓ©, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, RΓ©mi Munos, Michal Valko

  • SimSiam (sslsv.methods.SimSiam)
    Exploring Simple Siamese Representation Learning (arXiv)
    Xinlei Chen, Kaiming He

  • DINO (sslsv.methods.DINO)
    Emerging Properties in Self-Supervised Vision Transformers (arXiv)
    Mathilde Caron, Hugo Touvron, Ishan Misra, HervΓ© JΓ©gou, Julien Mairal, Piotr Bojanowski, Armand Joulin

Methods (ours)
  • Combiner (sslsv.methods.Combiner)
    Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning (arXiv)
    Theo Lepage, Reda Dehak

  • SimCLR Margins (sslsv.methods.SimCLRMargins)
    Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
    Theo Lepage, Reda Dehak

  • MoCo Margins (sslsv.methods.MoCoMargins)
    Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
    Theo Lepage, Reda Dehak

  • SSPS (sslsv.methods._SSPS)
    Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling (arxiv)
    Theo Lepage, Reda Dehak


Requirements

sslsv runs on Python 3.8 with the following dependencies.

Module Versions
torch >= 1.11.0
torchaudio >= 0.11.0
numpy *
pandas *
soundfile *
scikit-learn *
speechbrain *
tensorboard *
wandb *
ruamel.yaml *
dacite *
prettyprinter *
tqdm *

Note: developers will also need pytest, pre-commit and twine to work on this project.


Datasets

Speaker recognition:

Language recognition:

Emotion recognition:

Data-augmentation:

Data used for main experiments (conducted on VoxCeleb1 and VoxCeleb2 + data-augmentation) can be automatically downloaded, extracted and prepared using the following scripts.

python tools/prepare_data/prepare_voxceleb.py data/
python tools/prepare_data/prepare_augmentation.py data/

The resulting data folder shoud have the structure presented below.

data
β”œβ”€β”€ musan_split/
β”œβ”€β”€ simulated_rirs/
β”œβ”€β”€ voxceleb1/
β”œβ”€β”€ voxceleb2/
β”œβ”€β”€ voxceleb1_test_O
β”œβ”€β”€ voxceleb1_test_H
β”œβ”€β”€ voxceleb1_test_E
β”œβ”€β”€ voxsrc2021_val
β”œβ”€β”€ voxceleb1_train.csv
└── voxceleb2_train.csv

Other datasets have to be manually downloaded and extracted but their train and trials files can be created using the corresponding scripts from the tools/prepare_data/ folder.

  • Example format of a train file (voxceleb1_train.csv)

    File,Speaker
    voxceleb1/id10001/1zcIwhmdeo4/00001.wav,id10001
    ...
    voxceleb1/id11251/s4R4hvqrhFw/00009.wav,id11251
    
  • Example format of a trials file (voxceleb1_test_O)

    1 voxceleb1/id10270/x6uYqmx31kE/00001.wav voxceleb1/id10270/8jEAjG6SegY/00008.wav
    ...
    0 voxceleb1/id10309/0cYFdtyWVds/00005.wav voxceleb1/id10296/Y-qKARMSO7k/00001.wav
    

Installation

  1. Clone this repository: git clone https://github.com/theolepage/sslsv.git.
  2. Install dependencies: pip install -r requirements.txt.

Note: sslsv can also be installed as a standalone package via pip with pip install sslsv or with pip install . (in the project root folder) to get the latest version.


Usage

  • Start a training (2 GPUs): ./train_ddp.sh 2 <config_path>.
  • Evaluate your model (2 GPUs): ./evaluate_ddp.sh 2 <config_path>.

Note: use sslsv/bin/train.py and sslsv/bin/evaluate.py for non-distributed mode to run with a CPU, a single GPU or multiple GPUs (DataParallel).

Tensorboard

You can visualize your experiments with tensorboard --logdir models/your_model/.

wandb

Use wandb online and wandb offline to toggle wandb. To log your experiments you first need to provide your API key with wandb login API_KEY.


Documentation

Documentation is currently being developed...


Results

SOTA

  • Train set: VoxCeleb2
  • Evaluation: VoxCeleb1-O (Original)
  • Encoder: ECAPA-TDNN (C=1024)
Method Model EER (%) minDCF (p=0.01) Checkpoint
SimCLR ssl/voxceleb2/simclr/simclr_e-ecapa-1024 6.41 0.5160 πŸ”—
MoCo ssl/voxceleb2/moco/moco_e-ecapa-1024 6.38 0.5384 πŸ”—
SwAV ssl/voxceleb2/swav/swav_e-ecapa-1024 8.33 0.6120 πŸ”—
VICReg ssl/voxceleb2/vicreg/vicreg_e-ecapa-1024 7.85 0.6004 πŸ”—
DINO ssl/voxceleb2/dino/dino+_e-ecapa-1024 2.92 0.3523 πŸ”—
Supervised ssl/voxceleb2/supervised/supervised_e-ecapa-1024 1.34 0.1521 πŸ”—

Acknowledgements

sslsv contains third-party components and code adapted from other open-source projects, including: voxceleb_trainer, voxceleb_unsupervised and solo-learn.


Citations

If you use sslsv, please consider starring this repository on GitHub and citing one the following papers.

@Article{lepage2025SSLSVBootstrappedPositiveSampling,
  title     = {Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2025},
  journal   = {arXiv preprint library},
  url       = {https://arxiv.org/abs/2501.17772},
}

@InProceedings{lepage2024AdditiveMarginSSLSV,
  title     = {Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2024},
  booktitle = {The Speaker and Language Recognition Workshop (Odyssey 2024)},
  pages     = {38--42},
  doi       = {10.21437/odyssey.2024-6},
  url       = {https://www.isca-archive.org/odyssey_2024/lepage24_odyssey.html},
}

@InProceedings{lepage2023ExperimentingAdditiveMarginsSSLSV,
  title     = {Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2023},
  booktitle = {Interspeech 2023},
  pages     = {4708--4712},
  doi       = {10.21437/Interspeech.2023-1479},
  url       = {https://www.isca-speech.org/archive/interspeech_2023/lepage23_interspeech.html},
}

@InProceedings{lepage2022LabelEfficientSSLSV,
  title     = {Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning},
  author    = {Lepage, Theo and Dehak, Reda},
  year      = {2022},
  booktitle = {Interspeech 2022},
  pages     = {4018--4022},
  doi       = {10.21437/Interspeech.2022-802},
  url       = {https://www.isca-speech.org/archive/interspeech_2022/lepage22_interspeech.html},
}

License

This project is released under the MIT License.