Maintaining speaker IDs globally across audio file #question #126
Replies: 3 comments
-
Hi @petersnias,
In theory, speaker labels should be maintained during the whole conversation, but this is not guaranteed and depends on the good functioning of the clustering module (for which good hyper-parameters are key). In general, re-identification is clunkier in the beginning because of the limited information available, and gets better as the conversation progresses. Once centroids converge, it is reasonable to expect good re-identification performance. Given the pre-training objective of the embedding model, it's safer to expect the embedding for speaker 1 to be closer to the true centroid than to any other embedding from speaker 1.
This is the reason behind incremental clustering, its job is to unify labels from different predictions using pre-trained speaker embeddings. Currently, pyannote does not support online diarization.
That would be I hope this clarifies things, let me know if you have any other questions. |
Beta Was this translation helpful? Give feedback.
-
Thanks for creating this!
What is the best way to set this up/data sets/hyper-parameters?
Are there any enhancements you suggest that can be made to prevent that 16-second window from dropping the different speakers? Is there any suggested caches or efficient ways to do so. If not, I would appreciate any similar libraries you know do so! Thanks again and let me know if any additional information is needed! |
Beta Was this translation helpful? Give feedback.
-
Hi @Adawg4,
I would suggest using your own data to do this, otherwise the provided pre-trained models from pyannote (included by default in diart) are good general-purpose solutions.
Yes there are many ways in which diart can be improved! For example, you could fine-tune the segmentation and embedding models to your data, or you can contribute to make the clustering module better. If you're interested, I'm always eager to discuss ways in which diart can be improved! The idea behind this library is to provide the best possible streaming diarization performance at the lowest possible cost. |
Beta Was this translation helpful? Give feedback.
-
#question
How long can consistent speaker labels be maintained? For instance, if a speaker is labeled speaker 1 from 0 - 2 sec and then doesn't speaker again until 300-305 sec, will this speaker label be maintained?
Where could I find the code that performs the operations to ensure speaker labels remain consistent across audio chunks on longer audio files (i.e. 1-3 hours)?
I ask because we have an engine that processes audio
0-16 sec
2-18 sec
4-20 sec
or 16 sec with a 2-sec overlap. When we pass the audio to pyannote it seems to reset the speaker label based on the current 16-sec audio analysis window. What piece of code aids in maintaining speaker labels across overlapping chunks instead of resetting the labels every time?
Please let me know if you need further clarification on this question. Thank you
Beta Was this translation helpful? Give feedback.
All reactions