Add scripts compat with hindi eng dataset #661

klei22 · 2025-10-19T05:22:16Z

This pull request adds support for the IITB English–Hindi Parallel Corpus dataset, including documentation, scripts, and utilities for downloading and preparing the data. The main changes introduce a standardized workflow for fetching and flattening translation datasets from Hugging Face, leveraging reusable template utilities.

Documentation and Usage:

Added a comprehensive README.md for the IITB English–Hindi dataset, detailing its source, structure, usage examples, and citation information.

Dataset Preparation Workflow:

Introduced a shell script (get_dataset.sh) to automate downloading and flattening the dataset using a Python utility.
Linked a generic prepare.py script and utils directory from a shared template for consistent dataset preparation. [1] [2]

Reusable Utility Implementation:

Added get_translation_parquet_dataset.py, a robust utility for downloading, converting, and flattening translation Parquet datasets. This script handles fetching files, converting them to JSON, and emitting prefixed text records for each language.

Copilot

Pull Request Overview

This PR adds support for the IITB English–Hindi Parallel Corpus dataset by providing automated download and preparation scripts. The implementation follows a template-based approach for consistency across translation datasets.

Introduces a reusable Python utility for downloading and flattening translation datasets from Parquet format
Creates dataset-specific configuration using symlinks to shared template scripts
Adds comprehensive documentation for the IITB English–Hindi corpus including usage examples and citation information

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
data/template/utils/get_translation_parquet_dataset.py	New utility for downloading Parquet datasets from Hugging Face and converting them to prefixed text format
data/iitb-english-hindi/utils	Symlink to shared template utilities directory
data/iitb-english-hindi/prepare.py	Symlink to shared template preparation script
data/iitb-english-hindi/get_dataset.sh	Shell script that configures and runs the download utility for the IITB dataset
data/iitb-english-hindi/README.md	Documentation for the IITB English–Hindi dataset including overview, examples, and citation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-19T05:22:46Z

data/iitb-english-hindi/get_dataset.sh

+python utils/get_translation_parquet_dataset.py \
+  --url "$URL" \
+  --prefix en $'\nEN: ' \
+  --prefix hi $'HI: ' \


The English prefix includes a leading newline ($'\\nEN: ') while the Hindi prefix does not ($'HI: '). This inconsistency will cause the first English segment to be separated by an extra blank line from previous content, but Hindi segments won't have this leading separation. Either both prefixes should include the leading newline, or neither should.

Suggested change

--prefix hi $'HI: ' \

--prefix hi $'\nHI: ' \

Copilot · 2025-10-19T05:22:47Z

data/template/utils/get_translation_parquet_dataset.py

+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description=(
+            "Download Europarl-style translation Parquet files and emit prefixed text."


The description references 'Europarl-style' which is misleading since this utility is generic and works with any translation dataset in Parquet format, not just Europarl. The description should be updated to reflect its general-purpose nature.

Suggested change

"Download Europarl-style translation Parquet files and emit prefixed text."

"Download translation Parquet files from any supported dataset and emit prefixed text."

Copilot · 2025-10-19T05:22:47Z

data/template/utils/get_translation_parquet_dataset.py

+        emit_translation_items(json_path, output_text_file, language_prefixes)
+
+
+


Extra blank line before function definition violates PEP 8 style guidelines, which recommends exactly two blank lines between top-level function definitions.

Suggested change

klei22 and others added 3 commits October 19, 2025 05:10

Add translation-aware parquet dataset utility

d49829c

Add new parquet parser for common translation format

2e7a0bf

Add english-hindi dataset

27bb3b0

klei22 requested review from Copilot and gkielian October 19, 2025 05:22

Copilot AI reviewed Oct 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scripts compat with hindi eng dataset #661

Add scripts compat with hindi eng dataset #661

Uh oh!

klei22 commented Oct 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"Download Europarl-style translation Parquet files and emit prefixed text."
	"Download translation Parquet files from any supported dataset and emit prefixed text."

		emit_translation_items(json_path, output_text_file, language_prefixes)

Add scripts compat with hindi eng dataset #661

Are you sure you want to change the base?

Add scripts compat with hindi eng dataset #661

Uh oh!

Conversation

klei22 commented Oct 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant