-
Notifications
You must be signed in to change notification settings - Fork 27
Add script for new translation dataset format #657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a CLI utility to download Parquet-based translation datasets, convert them to JSON, and flatten translation fields into a prefixed text format for downstream use.
- Introduces emit_translation_items to flatten translation columns into prefixed lines.
- Adds download_translation_dataset to fetch Parquet shards, convert to JSON, and write output.
- Provides a CLI (main) to configure URL, output path, language/prefix pairs, and append mode.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| return | ||
|
|
||
| with open(json_path, "r", encoding="utf-8") as handle: | ||
| records = json.load(handle) |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json.load loads the entire shard into memory, which can be large and cause high memory usage. Consider streaming the JSON to process records incrementally (e.g., using ijson.items(handle, 'item') if the file is a JSON array, or switching convert_to_json to emit NDJSON and iterate line-by-line) to avoid holding the whole dataset in memory.
| download_dir = "./downloaded_parquets" | ||
| json_dir = "./json_output" | ||
| os.makedirs(download_dir, exist_ok=True) | ||
| os.makedirs(json_dir, exist_ok=True) |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_dir and json_dir are hard-coded, which limits reuse and makes it harder to control outputs in different environments. Expose these as function parameters and CLI options (with sensible defaults), so callers can direct intermediate files to desired locations or temporary directories.
| records = json.load(handle) | ||
|
|
||
| if not isinstance(records, list): | ||
| return |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silently returning on unexpected JSON structure makes failures hard to diagnose. Raise a ValueError (or log a clear message) that includes json_path and the observed type to help users identify malformed inputs.
| return | |
| raise ValueError( | |
| f"Expected a list at top level in JSON file '{json_path}', but got {type(records).__name__}. " | |
| "Please check that the input file is correctly formatted." | |
| ) |
| if not append: | ||
| open(output_text_file, "w", encoding="utf-8").close() | ||
|
|
||
| for link in parquet_links: |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure deterministic processing and output ordering, iterate over sorted(parquet_links). This avoids variability if find_parquet_links returns links in a non-stable order.
| for link in parquet_links: | |
| for link in sorted(parquet_links): |
No description provided.