Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
goodmike31 authored Mar 10, 2024
1 parent 8b97102 commit f4f95d2
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,8 @@ lastchecked = {2024-03-10}<br>
- [TSV](https://github.com/goodmike31/pl-asr-speech-data-survey/blob/main/snapshots/pl-asr-speech-datasets-taxonomy-latest.tsv)

## Limitations
The metadata collected for the catalog is primarily based on information available online in language repositories and scientific articles. When feasible, it was further verified through manual inspection of dataset content and discussion with dataset authors. Despite our best efforts, some essential metadata values may be missing. Naturally, if the metadata source contained errors, the resulting values in the catalog, metrics, and derived observations may be biased or inaccurate. Additionally, despite our best efforts to manually validate all collected data before publication, the catalog may contain inaccuracies resulting from curation-related errors.
We hope that making the catalog publicly available to the community will enable collective curation of the catalog and taxonomy, particularly by ad- dressing any errors in the catalog metadata that have a practical impact on the community. Engaging in discussions is also crucial for filling in missing values, establishing the availability of datasets with unknown status, and determining licensing and re-usage of existing datasets for Polish ASR benchmarking purposes. Lastly, all datasets available online are subject to automatic analysis to draw new insights from the data and verify values reported by the authors in the documentation. This process, however, may introduce additional limitations in terms of the accuracy and reliability of the resulting insights.
The metadata collected for the catalog is primarily based on information available online in language repositories and scientific articles. When feasible, it was further verified through manual inspection of dataset content and discussion with dataset authors. Despite author's best efforts, some essential metadata values may be missing. Naturally, if the metadata source contained errors, the resulting values in the catalog, metrics, and derived observations may be biased or inaccurate. Additionally, despite best efforts to manually validate all collected data before publication, the catalog may contain inaccuracies resulting from curation-related errors.
Hopefully making the catalog publicly available to the community will enable collective curation of the catalog and taxonomy, particularly by addressing any errors in the catalog metadata that have a practical impact. Engaging in discussions is also crucial for filling in missing values, establishing the availability of datasets with unknown status, and determining licensing and re-usage of existing datasets for Polish ASR benchmarking purposes. Lastly, all datasets available online are subject to automatic analysis to draw new insights from the data and verify values reported by the authors in the documentation. This process, however, may introduce additional limitations in terms of the accuracy and reliability of the resulting insights.

## Changelog

Expand Down

0 comments on commit f4f95d2

Please sign in to comment.