-
Notifications
You must be signed in to change notification settings - Fork 28
Add script for new translation dataset format #657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,145 @@ | ||||||||||||
| """Utilities for flattening translation columns in Parquet datasets.""" | ||||||||||||
| from __future__ import annotations | ||||||||||||
|
|
||||||||||||
| import argparse | ||||||||||||
| import json | ||||||||||||
| import os | ||||||||||||
| from typing import Iterable, Sequence, Tuple | ||||||||||||
|
|
||||||||||||
| from .get_parquet_dataset import convert_to_json, download_file, find_parquet_links | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| def emit_translation_items( | ||||||||||||
| json_path: str, | ||||||||||||
| output_path: str, | ||||||||||||
| language_prefixes: Sequence[Tuple[str, str]], | ||||||||||||
| ) -> None: | ||||||||||||
| """Emit flattened translation rows from ``json_path`` into ``output_path``. | ||||||||||||
|
|
||||||||||||
| Parameters | ||||||||||||
| ---------- | ||||||||||||
| json_path: | ||||||||||||
| Path to the JSON file produced from a Parquet shard. | ||||||||||||
| output_path: | ||||||||||||
| File where the flattened text should be appended. | ||||||||||||
| language_prefixes: | ||||||||||||
| Ordered collection of (language, prefix) tuples. Each translation entry | ||||||||||||
| writes one line per language using the associated prefix when the | ||||||||||||
| translation text is present. | ||||||||||||
| """ | ||||||||||||
| if not language_prefixes: | ||||||||||||
| return | ||||||||||||
|
|
||||||||||||
| with open(json_path, "r", encoding="utf-8") as handle: | ||||||||||||
| records = json.load(handle) | ||||||||||||
|
|
||||||||||||
| if not isinstance(records, list): | ||||||||||||
| return | ||||||||||||
|
||||||||||||
| return | |
| raise ValueError( | |
| f"Expected a list at top level in JSON file '{json_path}', but got {type(records).__name__}. " | |
| "Please check that the input file is correctly formatted." | |
| ) |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_dir and json_dir are hard-coded, which limits reuse and makes it harder to control outputs in different environments. Expose these as function parameters and CLI options (with sensible defaults), so callers can direct intermediate files to desired locations or temporary directories.
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure deterministic processing and output ordering, iterate over sorted(parquet_links). This avoids variability if find_parquet_links returns links in a non-stable order.
| for link in parquet_links: | |
| for link in sorted(parquet_links): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json.load loads the entire shard into memory, which can be large and cause high memory usage. Consider streaming the JSON to process records incrementally (e.g., using ijson.items(handle, 'item') if the file is a JSON array, or switching convert_to_json to emit NDJSON and iterate line-by-line) to avoid holding the whole dataset in memory.