sslsv is a PyTorch-based Deep Learning framework consisting of a collection of Self-Supervised Learning (SSL) methods for learning speaker representations applicable to different speaker-related downstream tasks, notably Speaker Verification (SV).
Our aim is to: (1) provide self-supervised SOTA methods by porting algorithms from the computer vision domain; and (2) evaluate them in a comparable environment.
Our training framework is depicted by the figure below.
- April 2024 β π Introduction of new various methods and complete refactoring (v2.0).
- June 2022 β π First release of sslsv (v1.0).
General
- Data:
- Supervised and Self-supervised datasets (siamese and DINO sampling)
- Audio augmentation (noise and reverberation)
- Training:
- CPU, GPU and multi-GPUs (DataParallel and DistributedDataParallel)
- Checkpointing, resuming, early stopping and logging
- Tensorboard and wandb
- Evaluation:
- Speaker verification
- Backend: Cosine scoring and PLDA
- Metrics: EER, MinDCF, ActDFC, CLLR, AvgRPrec
- Classification (emotion, language, ...)
- Speaker verification
- Notebooks: DET curve, scores distribution, t-SNE on embeddings, ...
- Misc: scalable config, typing, documentation and tests
Encoders
-
TDNN (
sslsv.encoders.TDNN
)
X-vectors: Robust dnn embeddings for speaker recognition (PDF)
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur -
Simple Audio CNN (
sslsv.encoders.SimpleAudioCNN
)
Representation Learning with Contrastive Predictive Coding (arXiv)
Aaron van den Oord, Yazhe Li, Oriol Vinyals -
ResNet-34 (
sslsv.encoders.ResNet34
)
VoxCeleb2: Deep Speaker Recognition (arXiv)
Joon Son Chung, Arsha Nagrani, Andrew Zisserman -
ECAPA-TDNN (
sslsv.encoders.ECAPATDNN
)
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification (arXiv)
Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck
Methods
-
LIM (
sslsv.methods.LIM
)
Learning Speaker Representations with Mutual Information (arXiv)
Mirco Ravanelli, Yoshua Bengio -
CPC (
sslsv.methods.CPC
)
Representation Learning with Contrastive Predictive Coding (arXiv)
Aaron van den Oord, Yazhe Li, Oriol Vinyals -
SimCLR (
sslsv.methods.SimCLR
)
A Simple Framework for Contrastive Learning of Visual Representations (arXiv)
Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton -
MoCo v2+ (
sslsv.methods.MoCo
)
Improved Baselines with Momentum Contrastive Learning (arXiv)
Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He -
DeepCluster v2 (
sslsv.methods.DeepCluster
)
Deep Clustering for Unsupervised Learning of Visual Features (arXiv)
Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze -
SwAV (
sslsv.methods.SwAV
)
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (arXiv)
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin -
W-MSE (
sslsv.methods.WMSE
)
Whitening for Self-Supervised Representation Learning (arXiv)
Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe -
Barlow Twins (
sslsv.methods.BarlowTwins
)
Barlow Twins: Self-Supervised Learning via Redundancy Reduction (arXiv)
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, StΓ©phane Deny -
VICReg (
sslsv.methods.VICReg
)
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning (arXiv)
Adrien Bardes, Jean Ponce, Yann LeCun -
VIbCReg (
sslsv.methods.VIbCReg
)
Computer Vision Self-supervised Learning Methods on Time Series (arXiv)
Daesoo Lee, Erlend Aune -
BYOL (
sslsv.methods.BYOL
)
Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (arXiv)
Jean-Bastien Grill, Florian Strub, Florent AltchΓ©, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, RΓ©mi Munos, Michal Valko -
SimSiam (
sslsv.methods.SimSiam
)
Exploring Simple Siamese Representation Learning (arXiv)
Xinlei Chen, Kaiming He -
DINO (
sslsv.methods.DINO
)
Emerging Properties in Self-Supervised Vision Transformers (arXiv)
Mathilde Caron, Hugo Touvron, Ishan Misra, HervΓ© JΓ©gou, Julien Mairal, Piotr Bojanowski, Armand Joulin
Methods (ours)
-
Combiner (
sslsv.methods.Combiner
)
Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning (arXiv)
Theo Lepage, Reda Dehak -
SimCLR Margins (
sslsv.methods.SimCLRMargins
)
Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
Theo Lepage, Reda Dehak -
MoCo Margins (
sslsv.methods.MoCoMargins
)
Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
Theo Lepage, Reda Dehak -
SSPS (
sslsv.methods._SSPS
)
Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling (arxiv)
Theo Lepage, Reda Dehak
sslsv runs on Python 3.8 with the following dependencies.
Module | Versions |
---|---|
torch | >= 1.11.0 |
torchaudio | >= 0.11.0 |
numpy | * |
pandas | * |
soundfile | * |
scikit-learn | * |
speechbrain | * |
tensorboard | * |
wandb | * |
ruamel.yaml | * |
dacite | * |
prettyprinter | * |
tqdm | * |
Note: developers will also need pytest
, pre-commit
and twine
to work on this project.
Speaker recognition:
Language recognition:
Emotion recognition:
Data-augmentation:
Data used for main experiments (conducted on VoxCeleb1 and VoxCeleb2 + data-augmentation) can be automatically downloaded, extracted and prepared using the following scripts.
python tools/prepare_data/prepare_voxceleb.py data/
python tools/prepare_data/prepare_augmentation.py data/
The resulting data
folder shoud have the structure presented below.
data
βββ musan_split/
βββ simulated_rirs/
βββ voxceleb1/
βββ voxceleb2/
βββ voxceleb1_test_O
βββ voxceleb1_test_H
βββ voxceleb1_test_E
βββ voxsrc2021_val
βββ voxceleb1_train.csv
βββ voxceleb2_train.csv
Other datasets have to be manually downloaded and extracted but their train and trials files can be created using the corresponding scripts from the tools/prepare_data/
folder.
-
Example format of a train file (
voxceleb1_train.csv
)File,Speaker voxceleb1/id10001/1zcIwhmdeo4/00001.wav,id10001 ... voxceleb1/id11251/s4R4hvqrhFw/00009.wav,id11251
-
Example format of a trials file (
voxceleb1_test_O
)1 voxceleb1/id10270/x6uYqmx31kE/00001.wav voxceleb1/id10270/8jEAjG6SegY/00008.wav ... 0 voxceleb1/id10309/0cYFdtyWVds/00005.wav voxceleb1/id10296/Y-qKARMSO7k/00001.wav
- Clone this repository:
git clone https://github.com/theolepage/sslsv.git
. - Install dependencies:
pip install -r requirements.txt
.
Note: sslsv can also be installed as a standalone package via pip with pip install sslsv
or with pip install .
(in the project root folder) to get the latest version.
- Start a training (2 GPUs):
./train_ddp.sh 2 <config_path>
. - Evaluate your model (2 GPUs):
./evaluate_ddp.sh 2 <config_path>
.
Note: use sslsv/bin/train.py
and sslsv/bin/evaluate.py
for non-distributed mode to run with a CPU, a single GPU or multiple GPUs (DataParallel).
You can visualize your experiments with tensorboard --logdir models/your_model/
.
Use wandb online
and wandb offline
to toggle wandb. To log your experiments you first need to provide your API key with wandb login API_KEY
.
Documentation is currently being developed...
- Train set: VoxCeleb2
- Evaluation: VoxCeleb1-O (Original)
- Encoder: ECAPA-TDNN (C=1024)
Method | Model | EER (%) | minDCF (p=0.01) | Checkpoint |
---|---|---|---|---|
SimCLR | ssl/voxceleb2/simclr/simclr_e-ecapa-1024 |
6.41 | 0.5160 | π |
MoCo | ssl/voxceleb2/moco/moco_e-ecapa-1024 |
6.38 | 0.5384 | π |
SwAV | ssl/voxceleb2/swav/swav_e-ecapa-1024 |
8.33 | 0.6120 | π |
VICReg | ssl/voxceleb2/vicreg/vicreg_e-ecapa-1024 |
7.85 | 0.6004 | π |
DINO | ssl/voxceleb2/dino/dino+_e-ecapa-1024 |
2.92 | 0.3523 | π |
Supervised | ssl/voxceleb2/supervised/supervised_e-ecapa-1024 |
1.34 | 0.1521 | π |
sslsv contains third-party components and code adapted from other open-source projects, including: voxceleb_trainer, voxceleb_unsupervised and solo-learn.
If you use sslsv, please consider starring this repository on GitHub and citing one the following papers.
@Article{lepage2025SSLSVBootstrappedPositiveSampling,
title = {Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling},
author = {Lepage, Theo and Dehak, Reda},
year = {2025},
journal = {arXiv preprint library},
url = {https://arxiv.org/abs/2501.17772},
}
@InProceedings{lepage2024AdditiveMarginSSLSV,
title = {Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations},
author = {Lepage, Theo and Dehak, Reda},
year = {2024},
booktitle = {The Speaker and Language Recognition Workshop (Odyssey 2024)},
pages = {38--42},
doi = {10.21437/odyssey.2024-6},
url = {https://www.isca-archive.org/odyssey_2024/lepage24_odyssey.html},
}
@InProceedings{lepage2023ExperimentingAdditiveMarginsSSLSV,
title = {Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification},
author = {Lepage, Theo and Dehak, Reda},
year = {2023},
booktitle = {Interspeech 2023},
pages = {4708--4712},
doi = {10.21437/Interspeech.2023-1479},
url = {https://www.isca-speech.org/archive/interspeech_2023/lepage23_interspeech.html},
}
@InProceedings{lepage2022LabelEfficientSSLSV,
title = {Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning},
author = {Lepage, Theo and Dehak, Reda},
year = {2022},
booktitle = {Interspeech 2022},
pages = {4018--4022},
doi = {10.21437/Interspeech.2022-802},
url = {https://www.isca-speech.org/archive/interspeech_2022/lepage22_interspeech.html},
}
This project is released under the MIT License.