You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 17, 2025. It is now read-only.
TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.
I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.
Libri-Light offers 60+ k hours of unlabelled speech, a small training set for limited supervision (10h, 1h or 10 minutes of labelled speech), and a common set of metrics to evaluated three settings:
You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!