Skip to content
This repository was archived by the owner on Jul 17, 2025. It is now read-only.
This repository was archived by the owner on Jul 17, 2025. It is now read-only.

Making Supervised Large Datasets for English / German / Spanish  #35

@snakers4

Description

@snakers4

Hi,

Have not found any contacts in the press-release or in the paper (please correct me if I am wrong), so I decided to open an issue here to reach out.

My name is Alexander, I am the main author of Open STT and these recent articles from The Gradient:

TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.

On Russian data we have built production grade models and have even deployed some high-load services into production (if you speak Russian - please follow these links http://silero.ai/, https://mobile-demo.silero.ai/, https://habr.com/ru/post/494006/)

I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.

Libri-Light offers 60+ k hours of unlabelled speech, a small training set for limited supervision (10h, 1h or 10 minutes of labelled speech), and a common set of metrics to evaluated three settings:

You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!

Best,
Alexander

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions