Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Oct 19, 2025

This pull request adds support for the IITB English–Hindi Parallel Corpus dataset, including documentation, scripts, and utilities for downloading and preparing the data. The main changes introduce a standardized workflow for fetching and flattening translation datasets from Hugging Face, leveraging reusable template utilities.

Documentation and Usage:

  • Added a comprehensive README.md for the IITB English–Hindi dataset, detailing its source, structure, usage examples, and citation information.

Dataset Preparation Workflow:

  • Introduced a shell script (get_dataset.sh) to automate downloading and flattening the dataset using a Python utility.
  • Linked a generic prepare.py script and utils directory from a shared template for consistent dataset preparation. [1] [2]

Reusable Utility Implementation:

  • Added get_translation_parquet_dataset.py, a robust utility for downloading, converting, and flattening translation Parquet datasets. This script handles fetching files, converting them to JSON, and emitting prefixed text records for each language.

@klei22 klei22 requested review from Copilot and gkielian October 19, 2025 05:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the IITB English–Hindi Parallel Corpus dataset by providing automated download and preparation scripts. The implementation follows a template-based approach for consistency across translation datasets.

  • Introduces a reusable Python utility for downloading and flattening translation datasets from Parquet format
  • Creates dataset-specific configuration using symlinks to shared template scripts
  • Adds comprehensive documentation for the IITB English–Hindi corpus including usage examples and citation information

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
data/template/utils/get_translation_parquet_dataset.py New utility for downloading Parquet datasets from Hugging Face and converting them to prefixed text format
data/iitb-english-hindi/utils Symlink to shared template utilities directory
data/iitb-english-hindi/prepare.py Symlink to shared template preparation script
data/iitb-english-hindi/get_dataset.sh Shell script that configures and runs the download utility for the IITB dataset
data/iitb-english-hindi/README.md Documentation for the IITB English–Hindi dataset including overview, examples, and citation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

python utils/get_translation_parquet_dataset.py \
--url "$URL" \
--prefix en $'\nEN: ' \
--prefix hi $'HI: ' \
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The English prefix includes a leading newline ($'\\nEN: ') while the Hindi prefix does not ($'HI: '). This inconsistency will cause the first English segment to be separated by an extra blank line from previous content, but Hindi segments won't have this leading separation. Either both prefixes should include the leading newline, or neither should.

Suggested change
--prefix hi $'HI: ' \
--prefix hi $'\nHI: ' \

Copilot uses AI. Check for mistakes.
def main() -> None:
parser = argparse.ArgumentParser(
description=(
"Download Europarl-style translation Parquet files and emit prefixed text."
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description references 'Europarl-style' which is misleading since this utility is generic and works with any translation dataset in Parquet format, not just Europarl. The description should be updated to reflect its general-purpose nature.

Suggested change
"Download Europarl-style translation Parquet files and emit prefixed text."
"Download translation Parquet files from any supported dataset and emit prefixed text."

Copilot uses AI. Check for mistakes.
emit_translation_items(json_path, output_text_file, language_prefixes)



Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line before function definition violates PEP 8 style guidelines, which recommends exactly two blank lines between top-level function definitions.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant