-
Notifications
You must be signed in to change notification settings - Fork 27
Add analysis scripts to flores200 dataset #705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Each script now has a common argument for stats_json, which emits the number of tokens not transcribed (those which will be held for byte tokenization). Added an espeak2ipa.py script which can target any of the espeak languages, and defaulting this to target shan for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds a comprehensive suite of analysis, visualization, and processing scripts for the Flores-200 restructured dataset. The changes introduce utilities for language-script analysis, IPA phoneticization with byte coverage statistics, tokenization comparison, and interactive visualizations.
Key changes:
- Added byte coverage statistics tracking to all IPA transcription scripts (Chinese, Korean, Japanese, English, and generic espeak-based)
- Created new plotting utilities for visualizing dataset sizes grouped by script/region/family and comparing tokenization methods
- Introduced shell scripts to automate graph generation and IPA processing workflows
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
data/template/utils/zh_to_ipa.py |
Enhanced with byte coverage stats tracking for transcribed vs non-transcribed content |
data/template/utils/ko_en_to_ipa.py |
Added stats tracking and helper functions for Korean IPA transcription |
data/template/utils/ja2ipa.py |
Integrated byte coverage statistics and simplified conditional logic |
data/template/utils/en2ipa.py |
Added stats tracking with thread-safe accumulation for English transcription |
data/template/utils/espeak2ipa.py |
New generic IPA transcription tool supporting any espeak-ng voice with multithreading |
data/flores200-res/plot_langscript_sizes_grouped.py |
Visualizes language-script sizes grouped by region/script/family |
data/flores200-res/plot_multi_script_languages.py |
Plots languages appearing in multiple scripts with fixed color mapping |
data/flores200-res/plot_tokenization_vs_original.py |
Compares tokenized vs original text sizes across methods |
data/flores200-res/plot_ipa_vs_text.py |
Analyzes IPA vs raw text sizes with optional tokenization comparison |
data/flores200-res/tokenize_and_annotate_sizes.py |
Tokenizes files and annotates JSON with tokenized sizes |
data/flores200-res/spm_vocab_freq_dashboard.py |
Interactive HTML dashboard for SentencePiece vocabulary analysis |
data/flores200-res/filter_files_by_script.py |
Extracts script/language fields from files.json |
data/flores200-res/phoneticize.sh |
Updated to process multiple languages with stats output |
data/flores200-res/graphs.sh |
Automates generation of various dataset visualizations |
data/flores200-res/ipa_scripts.sh |
Shell script for IPA vs text comparison workflows |
data/flores200-res/tokenize.sh |
Wrapper for tokenization with tiktoken |
data/flores200-res/tokenization_vs_origina.sh |
Generates tokenization ratio plots (filename has typo) |
data/flores200-res/get_dataset.sh |
Expanded language array with additional languages |
data/flores200-res/*.json |
Added stats files and filtered dataset entries |
data/flores200-res/README.md |
Documentation for scripts, license, and language code references |
data/flores200-res/.gitignore |
Excludes generated PNG files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import argparse | ||
| import re | ||
| import json | ||
| from typing import List, Optional, Dict, Any, Tuple |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Tuple' is not used.
| from typing import List, Optional, Dict, Any, Tuple | |
| from typing import List, Optional, Dict, Any |
|
|
||
| import argparse | ||
| import json | ||
| import os |
Copilot
AI
Jan 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
This pull request introduces several new scripts and updates for working with the Flores-200 restructured dataset, focusing on language-script analysis, phoneticization, and visualization. It adds new Python utilities for filtering and plotting dataset statistics, updates shell scripts for processing and analyzing data, and includes new documentation and stats files.
New analysis and plotting scripts:
plot_langscript_sizes_grouped.py, a Python script for visualizing language-script dataset sizes grouped and colored by script, region, or family. It includes logic for mapping scripts to regions and families and generates grouped bar plots.filter_files_by_script.py, a Python utility to extract relevant fields fromfiles.jsonfor script/language analysis, outputting a simplified JSON for downstream analysis.Shell script improvements and additions:
phoneticize.shto process multiple languages usingespeak2ipa.py, with improved logic for handling multiple files and stats output.graphs.shandipa_scripts.shshell scripts to automate generation of visualizations and IPA/text comparisons using the new Python plotting utilities. [1] [2]Documentation and dataset updates:
README.mddescribing the scripts, dataset license, and language code references for the restructured Flores-200 dataset.eng_stats.json,ja_stats.json,ko_stats.json) reporting transcription statistics for English, Japanese, and Korean data. [1] [2] [3]Miscellaneous:
.gitignoreto exclude PNG files generated by plotting scripts.These changes provide a more robust framework for analyzing, processing, and visualizing the Flores-200 restructured dataset, making it easier to work with language-script data and phonetic transcriptions.