This repository contains the official demo code for DART, accepted at the Audio Imagination Workshop, NeurIPS 2024.
DART disentangles speaker identity and accent representation in multispeaker TTS using a structured latent framework.
Train on L2-ARCTIC:
CUDA_VISIBLE_DEVICES=0 python train.py --dataset L2ARCTIC
Two synthesis scripts are provided:
-
synthesize_converted.pyGenerates speech across combinations of speakers, accents, and sentences. -
synthesize_stats_valset.py
Generates speech from ametadata .txtfile.
Before inference, extract embeddings:
python extract_stats.py
This saves MLVAE-based embeddings for speakers and accents.
CUDA_VISIBLE_DEVICES=0 python synthesize_converted.py --dataset L2ARCTIC --restore_step 704000
If you find this model useful, please cite our paper:
@inproceedings{melechovsky2024dart,
title={DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech},
author={Melechovsky, J. and Mehrish, A. and Sisman, B. and Herremans, D.},
booktitle={Audio Imagination Workshop, NeurIPS},
year={2024}
}
Based on Comprehensive Transformer TTS by Keon Lee et al.
Open an issue for questions or collaboration.