Add analysis scripts to flores200 dataset #705

klei22 · 2026-01-04T03:11:22Z

This pull request introduces several new scripts and updates for working with the Flores-200 restructured dataset, focusing on language-script analysis, phoneticization, and visualization. It adds new Python utilities for filtering and plotting dataset statistics, updates shell scripts for processing and analyzing data, and includes new documentation and stats files.

New analysis and plotting scripts:

Added plot_langscript_sizes_grouped.py, a Python script for visualizing language-script dataset sizes grouped and colored by script, region, or family. It includes logic for mapping scripts to regions and families and generates grouped bar plots.
Added filter_files_by_script.py, a Python utility to extract relevant fields from files.json for script/language analysis, outputting a simplified JSON for downstream analysis.

Shell script improvements and additions:

Updated phoneticize.sh to process multiple languages using espeak2ipa.py, with improved logic for handling multiple files and stats output.
Added graphs.sh and ipa_scripts.sh shell scripts to automate generation of visualizations and IPA/text comparisons using the new Python plotting utilities. [1] [2]

Documentation and dataset updates:

Added a new README.md describing the scripts, dataset license, and language code references for the restructured Flores-200 dataset.
Added JSON stats files (eng_stats.json, ja_stats.json, ko_stats.json) reporting transcription statistics for English, Japanese, and Korean data. [1] [2] [3]

Miscellaneous:

Updated .gitignore to exclude PNG files generated by plotting scripts.

These changes provide a more robust framework for analyzing, processing, and visualizing the Flores-200 restructured dataset, making it easier to work with language-script data and phonetic transcriptions.

Each script now has a common argument for stats_json, which emits the number of tokens not transcribed (those which will be held for byte tokenization). Added an espeak2ipa.py script which can target any of the espeak languages, and defaulting this to target shan for now.

Copilot

Pull request overview

This pull request adds a comprehensive suite of analysis, visualization, and processing scripts for the Flores-200 restructured dataset. The changes introduce utilities for language-script analysis, IPA phoneticization with byte coverage statistics, tokenization comparison, and interactive visualizations.

Key changes:

Added byte coverage statistics tracking to all IPA transcription scripts (Chinese, Korean, Japanese, English, and generic espeak-based)
Created new plotting utilities for visualizing dataset sizes grouped by script/region/family and comparing tokenization methods
Introduced shell scripts to automate graph generation and IPA processing workflows

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`data/template/utils/zh_to_ipa.py`	Enhanced with byte coverage stats tracking for transcribed vs non-transcribed content
`data/template/utils/ko_en_to_ipa.py`	Added stats tracking and helper functions for Korean IPA transcription
`data/template/utils/ja2ipa.py`	Integrated byte coverage statistics and simplified conditional logic
`data/template/utils/en2ipa.py`	Added stats tracking with thread-safe accumulation for English transcription
`data/template/utils/espeak2ipa.py`	New generic IPA transcription tool supporting any espeak-ng voice with multithreading
`data/flores200-res/plot_langscript_sizes_grouped.py`	Visualizes language-script sizes grouped by region/script/family
`data/flores200-res/plot_multi_script_languages.py`	Plots languages appearing in multiple scripts with fixed color mapping
`data/flores200-res/plot_tokenization_vs_original.py`	Compares tokenized vs original text sizes across methods
`data/flores200-res/plot_ipa_vs_text.py`	Analyzes IPA vs raw text sizes with optional tokenization comparison
`data/flores200-res/tokenize_and_annotate_sizes.py`	Tokenizes files and annotates JSON with tokenized sizes
`data/flores200-res/spm_vocab_freq_dashboard.py`	Interactive HTML dashboard for SentencePiece vocabulary analysis
`data/flores200-res/filter_files_by_script.py`	Extracts script/language fields from files.json
`data/flores200-res/phoneticize.sh`	Updated to process multiple languages with stats output
`data/flores200-res/graphs.sh`	Automates generation of various dataset visualizations
`data/flores200-res/ipa_scripts.sh`	Shell script for IPA vs text comparison workflows
`data/flores200-res/tokenize.sh`	Wrapper for tokenization with tiktoken
`data/flores200-res/tokenization_vs_origina.sh`	Generates tokenization ratio plots (filename has typo)
`data/flores200-res/get_dataset.sh`	Expanded language array with additional languages
`data/flores200-res/*.json`	Added stats files and filtered dataset entries
`data/flores200-res/README.md`	Documentation for scripts, license, and language code references
`data/flores200-res/.gitignore`	Excludes generated PNG files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-04T03:13:38Z

data/template/utils/espeak2ipa.py

+import argparse
+import re
+import json
+from typing import List, Optional, Dict, Any, Tuple


Import of 'Tuple' is not used.

Suggested change

from typing import List, Optional, Dict, Any, Tuple

from typing import List, Optional, Dict, Any

Copilot · 2026-01-04T03:13:38Z

data/flores200-res/tokenize_and_annotate_sizes.py

+
+import argparse
+import json
+import os


Import of 'os' is not used.

Suggested change

import os

klei22 added 10 commits December 23, 2025 18:49

Adding stats for en, zh, and ja

21dab62

Add README.md stats and script updates

8a5ec69

Add Yue

8e07a23

Add Yue to get dataset

8346e3a

Add graphs for grouping bytes of languages

fe33152

Add scripts for language analysis

8e9993a

Update ipa visualizations

7449141

Add .gitignore

686f2b2

Add updates to latest scripts

375e5b9

klei22 requested review from Copilot and gkielian January 4, 2026 03:11

Copilot started reviewing on behalf of klei22 January 4, 2026 03:11 View session

Copilot AI reviewed Jan 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add analysis scripts to flores200 dataset #705

Add analysis scripts to flores200 dataset #705

Uh oh!

klei22 commented Jan 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 4, 2026

Uh oh!

Copilot AI Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	from typing import List, Optional, Dict, Any, Tuple
	from typing import List, Optional, Dict, Any

Add analysis scripts to flores200 dataset #705

Are you sure you want to change the base?

Add analysis scripts to flores200 dataset #705

Uh oh!

Conversation

klei22 commented Jan 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant