Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Oct 19, 2025

No description provided.

@klei22 klei22 requested review from Copilot and gkielian October 19, 2025 00:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a CLI utility to download Parquet-based translation datasets, convert them to JSON, and flatten translation fields into a prefixed text format for downstream use.

  • Introduces emit_translation_items to flatten translation columns into prefixed lines.
  • Adds download_translation_dataset to fetch Parquet shards, convert to JSON, and write output.
  • Provides a CLI (main) to configure URL, output path, language/prefix pairs, and append mode.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

return

with open(json_path, "r", encoding="utf-8") as handle:
records = json.load(handle)
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.load loads the entire shard into memory, which can be large and cause high memory usage. Consider streaming the JSON to process records incrementally (e.g., using ijson.items(handle, 'item') if the file is a JSON array, or switching convert_to_json to emit NDJSON and iterate line-by-line) to avoid holding the whole dataset in memory.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +75
download_dir = "./downloaded_parquets"
json_dir = "./json_output"
os.makedirs(download_dir, exist_ok=True)
os.makedirs(json_dir, exist_ok=True)
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_dir and json_dir are hard-coded, which limits reuse and makes it harder to control outputs in different environments. Expose these as function parameters and CLI options (with sensible defaults), so callers can direct intermediate files to desired locations or temporary directories.

Copilot uses AI. Check for mistakes.
records = json.load(handle)

if not isinstance(records, list):
return
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently returning on unexpected JSON structure makes failures hard to diagnose. Raise a ValueError (or log a clear message) that includes json_path and the observed type to help users identify malformed inputs.

Suggested change
return
raise ValueError(
f"Expected a list at top level in JSON file '{json_path}', but got {type(records).__name__}. "
"Please check that the input file is correctly formatted."
)

Copilot uses AI. Check for mistakes.
if not append:
open(output_text_file, "w", encoding="utf-8").close()

for link in parquet_links:
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure deterministic processing and output ordering, iterate over sorted(parquet_links). This avoids variability if find_parquet_links returns links in a non-stable order.

Suggested change
for link in parquet_links:
for link in sorted(parquet_links):

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant