Making Supervised Large Datasets for English / German / Spanish 

Hi,

Have not found any contacts in the press-release or in the paper (please correct me if I am wrong), so I decided to open an issue here to reach out.

My name is Alexander, I am the main author of [Open STT](https://github.com/snakers4/open_stt) and these recent articles from The Gradient:
- https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
- https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/

TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.

On Russian data we have built production grade models and have even deployed some high-load services into production (if you speak Russian - please follow these links http://silero.ai/, https://mobile-demo.silero.ai/, https://habr.com/ru/post/494006/)

I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.

> Libri-Light offers 60+ k hours of unlabelled speech, a small training set for limited supervision (10h, 1h or 10 minutes of labelled speech), and a common set of metrics to evaluated three settings:

You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!

Best,
Alexander

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Making Supervised Large Datasets for English / German / Spanish #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Making Supervised Large Datasets for English / German / Spanish #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions