Maintaining speaker IDs globally across audio file #question #126

petersnias · 2023-02-04T00:30:48Z

petersnias
Feb 4, 2023

#question

How long can consistent speaker labels be maintained? For instance, if a speaker is labeled speaker 1 from 0 - 2 sec and then doesn't speaker again until 300-305 sec, will this speaker label be maintained?

Where could I find the code that performs the operations to ensure speaker labels remain consistent across audio chunks on longer audio files (i.e. 1-3 hours)?

I ask because we have an engine that processes audio
0-16 sec
2-18 sec
4-20 sec
or 16 sec with a 2-sec overlap. When we pass the audio to pyannote it seems to reset the speaker label based on the current 16-sec audio analysis window. What piece of code aids in maintaining speaker labels across overlapping chunks instead of resetting the labels every time?

Please let me know if you need further clarification on this question. Thank you

juanmc2005 · 2023-02-06T09:55:57Z

juanmc2005
Feb 6, 2023
Maintainer

Hi @petersnias,

How long can consistent speaker labels be maintained? For instance, if a speaker is labeled speaker 1 from 0 - 2 sec and then doesn't speaker again until 300-305 sec, will this speaker label be maintained?

In theory, speaker labels should be maintained during the whole conversation, but this is not guaranteed and depends on the good functioning of the clustering module (for which good hyper-parameters are key). In general, re-identification is clunkier in the beginning because of the limited information available, and gets better as the conversation progresses. Once centroids converge, it is reasonable to expect good re-identification performance.

Given the pre-training objective of the embedding model, it's safer to expect the embedding for speaker 1 to be closer to the true centroid than to any other embedding from speaker 1.
Since we don't know the true speaker centroids, we estimate them by aggregating same-speaker embeddings from past chunks, which are obtained as the conversation progresses.
In your example, if speaker 1 is only active at 0s-2s and then again at 300s-305s, I would say it's unlikely that it will be correctly reassigned, as it's also unlikely that the estimated centroid approximates the true centroid of speaker 1.
However, it should also be easier to guess than in the beginning because other estimated centroids might have converged far away from speaker 1's estimated centroid.

When we pass the audio to pyannote it seems to reset the speaker label based on the current 16-sec audio analysis window.

This is the reason behind incremental clustering, its job is to unify labels from different predictions using pre-trained speaker embeddings. Currently, pyannote does not support online diarization.

What piece of code aids in maintaining speaker labels across overlapping chunks instead of resetting the labels every time?

That would be OnlineSpeakerClustering, called here and with its main logic implemented here.

I hope this clarifies things, let me know if you have any other questions.

0 replies

Adawg4 · 2023-02-07T20:15:15Z

Adawg4
Feb 7, 2023

Thanks for creating this!

In theory, speaker labels should be maintained during the whole conversation, but this is not guaranteed and depends on the good functioning of the clustering module (for which good hyper-parameters are key).

What is the best way to set this up/data sets/hyper-parameters?

This is the reason behind incremental clustering, its job is to unify labels from different predictions using pre-trained speaker embeddings. Currently, pyannote does not support online diarization.

Are there any enhancements you suggest that can be made to prevent that 16-second window from dropping the different speakers? Is there any suggested caches or efficient ways to do so. If not, I would appreciate any similar libraries you know do so!

Thanks again and let me know if any additional information is needed!

0 replies

juanmc2005 · 2023-02-08T12:56:47Z

juanmc2005
Feb 8, 2023
Maintainer

Hi @Adawg4,

What is the best way to set this up/data sets/hyper-parameters?

I would suggest using your own data to do this, otherwise the provided pre-trained models from pyannote (included by default in diart) are good general-purpose solutions.
With diart you also get the Optimizer class and diart.tune to optimize hyper-parameters for your data (see here).

Are there any enhancements you suggest that can be made to prevent that 16-second window from dropping the different speakers? Is there any suggested caches or efficient ways to do so. If not, I would appreciate any similar libraries you know do so!

Yes there are many ways in which diart can be improved! For example, you could fine-tune the segmentation and embedding models to your data, or you can contribute to make the clustering module better.
For instance, relaxing constraints or conserving past audio (apart from speaker embeddings). However, the latter would also raise the real time latency considerably.

If you're interested, I'm always eager to discuss ways in which diart can be improved!

The idea behind this library is to provide the best possible streaming diarization performance at the lowest possible cost.
I'm not aware of any other libraries for real-time diarization but please let me know if you find any.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintaining speaker IDs globally across audio file #question #126

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Maintaining speaker IDs globally across audio file #question #126

petersnias Feb 4, 2023

Replies: 3 comments

juanmc2005 Feb 6, 2023 Maintainer

Adawg4 Feb 7, 2023

juanmc2005 Feb 8, 2023 Maintainer

petersnias
Feb 4, 2023

juanmc2005
Feb 6, 2023
Maintainer

Adawg4
Feb 7, 2023

juanmc2005
Feb 8, 2023
Maintainer