-
Notifications
You must be signed in to change notification settings - Fork 27
Add scripts compat with hindi eng dataset #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add scripts compat with hindi eng dataset #661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the IITB English–Hindi Parallel Corpus dataset by providing automated download and preparation scripts. The implementation follows a template-based approach for consistency across translation datasets.
- Introduces a reusable Python utility for downloading and flattening translation datasets from Parquet format
- Creates dataset-specific configuration using symlinks to shared template scripts
- Adds comprehensive documentation for the IITB English–Hindi corpus including usage examples and citation information
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| data/template/utils/get_translation_parquet_dataset.py | New utility for downloading Parquet datasets from Hugging Face and converting them to prefixed text format |
| data/iitb-english-hindi/utils | Symlink to shared template utilities directory |
| data/iitb-english-hindi/prepare.py | Symlink to shared template preparation script |
| data/iitb-english-hindi/get_dataset.sh | Shell script that configures and runs the download utility for the IITB dataset |
| data/iitb-english-hindi/README.md | Documentation for the IITB English–Hindi dataset including overview, examples, and citation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| python utils/get_translation_parquet_dataset.py \ | ||
| --url "$URL" \ | ||
| --prefix en $'\nEN: ' \ | ||
| --prefix hi $'HI: ' \ |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The English prefix includes a leading newline ($'\\nEN: ') while the Hindi prefix does not ($'HI: '). This inconsistency will cause the first English segment to be separated by an extra blank line from previous content, but Hindi segments won't have this leading separation. Either both prefixes should include the leading newline, or neither should.
| --prefix hi $'HI: ' \ | |
| --prefix hi $'\nHI: ' \ |
| def main() -> None: | ||
| parser = argparse.ArgumentParser( | ||
| description=( | ||
| "Download Europarl-style translation Parquet files and emit prefixed text." |
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description references 'Europarl-style' which is misleading since this utility is generic and works with any translation dataset in Parquet format, not just Europarl. The description should be updated to reflect its general-purpose nature.
| "Download Europarl-style translation Parquet files and emit prefixed text." | |
| "Download translation Parquet files from any supported dataset and emit prefixed text." |
| emit_translation_items(json_path, output_text_file, language_prefixes) | ||
|
|
||
|
|
||
|
|
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra blank line before function definition violates PEP 8 style guidelines, which recommends exactly two blank lines between top-level function definitions.
This pull request adds support for the IITB English–Hindi Parallel Corpus dataset, including documentation, scripts, and utilities for downloading and preparing the data. The main changes introduce a standardized workflow for fetching and flattening translation datasets from Hugging Face, leveraging reusable template utilities.
Documentation and Usage:
README.mdfor the IITB English–Hindi dataset, detailing its source, structure, usage examples, and citation information.Dataset Preparation Workflow:
get_dataset.sh) to automate downloading and flattening the dataset using a Python utility.prepare.pyscript andutilsdirectory from a shared template for consistent dataset preparation. [1] [2]Reusable Utility Implementation:
get_translation_parquet_dataset.py, a robust utility for downloading, converting, and flattening translation Parquet datasets. This script handles fetching files, converting them to JSON, and emitting prefixed text records for each language.