Add script for new translation dataset format #657

klei22 · 2025-10-19T00:23:13Z

No description provided.

Copilot

Pull Request Overview

Adds a CLI utility to download Parquet-based translation datasets, convert them to JSON, and flatten translation fields into a prefixed text format for downstream use.

Introduces emit_translation_items to flatten translation columns into prefixed lines.
Adds download_translation_dataset to fetch Parquet shards, convert to JSON, and write output.
Provides a CLI (main) to configure URL, output path, language/prefix pairs, and append mode.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-19T00:25:15Z

data/template/utils/get_translation_parquet_dataset.py

+        return
+
+    with open(json_path, "r", encoding="utf-8") as handle:
+        records = json.load(handle)


json.load loads the entire shard into memory, which can be large and cause high memory usage. Consider streaming the JSON to process records incrementally (e.g., using ijson.items(handle, 'item') if the file is a JSON array, or switching convert_to_json to emit NDJSON and iterate line-by-line) to avoid holding the whole dataset in memory.

Copilot · 2025-10-19T00:25:15Z

data/template/utils/get_translation_parquet_dataset.py

+    download_dir = "./downloaded_parquets"
+    json_dir = "./json_output"
+    os.makedirs(download_dir, exist_ok=True)
+    os.makedirs(json_dir, exist_ok=True)


download_dir and json_dir are hard-coded, which limits reuse and makes it harder to control outputs in different environments. Expose these as function parameters and CLI options (with sensible defaults), so callers can direct intermediate files to desired locations or temporary directories.

Copilot · 2025-10-19T00:25:15Z

data/template/utils/get_translation_parquet_dataset.py

+        records = json.load(handle)
+
+    if not isinstance(records, list):
+        return


Silently returning on unexpected JSON structure makes failures hard to diagnose. Raise a ValueError (or log a clear message) that includes json_path and the observed type to help users identify malformed inputs.

Suggested change

return

raise ValueError(

f"Expected a list at top level in JSON file '{json_path}', but got {type(records).__name__}. "

"Please check that the input file is correctly formatted."

)

Copilot · 2025-10-19T00:25:16Z

data/template/utils/get_translation_parquet_dataset.py

+    if not append:
+        open(output_text_file, "w", encoding="utf-8").close()
+
+    for link in parquet_links:


To ensure deterministic processing and output ordering, iterate over sorted(parquet_links). This avoids variability if find_parquet_links returns links in a non-stable order.

Suggested change

for link in parquet_links:

for link in sorted(parquet_links):

Add translation-aware parquet dataset utility

d1efabf

klei22 requested review from Copilot and gkielian October 19, 2025 00:23

Copilot AI reviewed Oct 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add script for new translation dataset format #657

Add script for new translation dataset format #657

Uh oh!

klei22 commented Oct 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        return
+        raise ValueError(
+            f"Expected a list at top level in JSON file '{json_path}', but got {type(records).__name__}. "
+            "Please check that the input file is correctly formatted."
+        )

	for link in parquet_links:
	for link in sorted(parquet_links):

Add script for new translation dataset format #657

Are you sure you want to change the base?

Add script for new translation dataset format #657

Uh oh!

Conversation

klei22 commented Oct 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant