Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Jan 4, 2026

This pull request introduces several new scripts and updates for working with the Flores-200 restructured dataset, focusing on language-script analysis, phoneticization, and visualization. It adds new Python utilities for filtering and plotting dataset statistics, updates shell scripts for processing and analyzing data, and includes new documentation and stats files.

New analysis and plotting scripts:

  • Added plot_langscript_sizes_grouped.py, a Python script for visualizing language-script dataset sizes grouped and colored by script, region, or family. It includes logic for mapping scripts to regions and families and generates grouped bar plots.
  • Added filter_files_by_script.py, a Python utility to extract relevant fields from files.json for script/language analysis, outputting a simplified JSON for downstream analysis.

Shell script improvements and additions:

  • Updated phoneticize.sh to process multiple languages using espeak2ipa.py, with improved logic for handling multiple files and stats output.
  • Added graphs.sh and ipa_scripts.sh shell scripts to automate generation of visualizations and IPA/text comparisons using the new Python plotting utilities. [1] [2]

Documentation and dataset updates:

  • Added a new README.md describing the scripts, dataset license, and language code references for the restructured Flores-200 dataset.
  • Added JSON stats files (eng_stats.json, ja_stats.json, ko_stats.json) reporting transcription statistics for English, Japanese, and Korean data. [1] [2] [3]

Miscellaneous:

  • Updated .gitignore to exclude PNG files generated by plotting scripts.

These changes provide a more robust framework for analyzing, processing, and visualizing the Flores-200 restructured dataset, making it easier to work with language-script data and phonetic transcriptions.

klei22 added 10 commits December 23, 2025 18:49
Each script now has a common argument for stats_json, which emits the
number of tokens not transcribed (those which will be held for byte
tokenization).

Added an espeak2ipa.py script which can target any of the espeak
languages, and defaulting this to target shan for now.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a comprehensive suite of analysis, visualization, and processing scripts for the Flores-200 restructured dataset. The changes introduce utilities for language-script analysis, IPA phoneticization with byte coverage statistics, tokenization comparison, and interactive visualizations.

Key changes:

  • Added byte coverage statistics tracking to all IPA transcription scripts (Chinese, Korean, Japanese, English, and generic espeak-based)
  • Created new plotting utilities for visualizing dataset sizes grouped by script/region/family and comparing tokenization methods
  • Introduced shell scripts to automate graph generation and IPA processing workflows

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
data/template/utils/zh_to_ipa.py Enhanced with byte coverage stats tracking for transcribed vs non-transcribed content
data/template/utils/ko_en_to_ipa.py Added stats tracking and helper functions for Korean IPA transcription
data/template/utils/ja2ipa.py Integrated byte coverage statistics and simplified conditional logic
data/template/utils/en2ipa.py Added stats tracking with thread-safe accumulation for English transcription
data/template/utils/espeak2ipa.py New generic IPA transcription tool supporting any espeak-ng voice with multithreading
data/flores200-res/plot_langscript_sizes_grouped.py Visualizes language-script sizes grouped by region/script/family
data/flores200-res/plot_multi_script_languages.py Plots languages appearing in multiple scripts with fixed color mapping
data/flores200-res/plot_tokenization_vs_original.py Compares tokenized vs original text sizes across methods
data/flores200-res/plot_ipa_vs_text.py Analyzes IPA vs raw text sizes with optional tokenization comparison
data/flores200-res/tokenize_and_annotate_sizes.py Tokenizes files and annotates JSON with tokenized sizes
data/flores200-res/spm_vocab_freq_dashboard.py Interactive HTML dashboard for SentencePiece vocabulary analysis
data/flores200-res/filter_files_by_script.py Extracts script/language fields from files.json
data/flores200-res/phoneticize.sh Updated to process multiple languages with stats output
data/flores200-res/graphs.sh Automates generation of various dataset visualizations
data/flores200-res/ipa_scripts.sh Shell script for IPA vs text comparison workflows
data/flores200-res/tokenize.sh Wrapper for tokenization with tiktoken
data/flores200-res/tokenization_vs_origina.sh Generates tokenization ratio plots (filename has typo)
data/flores200-res/get_dataset.sh Expanded language array with additional languages
data/flores200-res/*.json Added stats files and filtered dataset entries
data/flores200-res/README.md Documentation for scripts, license, and language code references
data/flores200-res/.gitignore Excludes generated PNG files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import argparse
import re
import json
from typing import List, Optional, Dict, Any, Tuple
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Tuple' is not used.

Suggested change
from typing import List, Optional, Dict, Any, Tuple
from typing import List, Optional, Dict, Any

Copilot uses AI. Check for mistakes.

import argparse
import json
import os
Copy link

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant