-
Notifications
You must be signed in to change notification settings - Fork 27
Add scripts compat with hindi eng dataset #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # IITB English–Hindi Parallel Corpus (cfilt/iitb-english-hindi) | ||
|
|
||
| ### Dataset Overview | ||
|
|
||
| The **IIT Bombay English-Hindi Parallel Corpus** is a large-scale bilingual | ||
| dataset created by the **Center for Indian Language Technology (CFILT)** at IIT | ||
| Bombay. It contains **1.66 million English–Hindi sentence pairs** collected | ||
| from multiple open sources and curated over several years for **machine | ||
| translation and linguistic research**. | ||
|
|
||
| | Field | Value | | ||
| | --------------------- | ----------------------------------------------------------------------------------------------------------------------- | | ||
| | **Dataset name** | `cfilt/iitb-english-hindi` | | ||
| | **Languages** | English (`en`), Hindi (`hi`) | | ||
| | **Modality** | Text (parallel corpus) | | ||
| | **Format** | Parquet | | ||
| | **Size** | ~190 MB (≈ 1.66 M rows) | | ||
| | **Splits** | `train`, `validation`, `test` | | ||
| | **License** | [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) | | ||
| | **Hugging Face page** | 🔗 [https://huggingface.co/datasets/cfilt/iitb-english-hindi](https://huggingface.co/datasets/cfilt/iitb-english-hindi) | | ||
| | **Official site** | [http://www.cfilt.iitb.ac.in/iitb_parallel](http://www.cfilt.iitb.ac.in/iitb_parallel) | | ||
|
|
||
| --- | ||
|
|
||
| ### 🧠 Example Record | ||
|
|
||
| ```json | ||
| { | ||
| "en": "Give your application an accessibility workout", | ||
| "hi": "अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें" | ||
| } | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| 🔗 [IITB-English-Hindi-PC GitHub](https://github.com/cfiltnlp/IITB-English-Hindi-PC) | ||
|
|
||
| --- | ||
|
|
||
| ### 🧩 Typical Uses | ||
|
|
||
| * English↔Hindi machine translation | ||
| * Bilingual lexicon extraction | ||
| * Cross-lingual representation learning | ||
| * Evaluation of translation quality metrics (BLEU, chrF, etc.) | ||
|
|
||
| --- | ||
|
|
||
| ### 🧾 Citation | ||
|
|
||
| If you use this dataset, please cite: | ||
|
|
||
| > **Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya** | ||
| > *The IIT Bombay English–Hindi Parallel Corpus* | ||
| > *Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)*, Miyazaki, Japan. | ||
|
|
||
| ```bibtex | ||
| @inproceedings{kunchukuttan-etal-2018-iit, | ||
| title = {The IIT Bombay English-Hindi Parallel Corpus}, | ||
| author = {Kunchukuttan, Anoop and Mehta, Pratik and Bhattacharyya, Pushpak}, | ||
| booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, | ||
| year = {2018}, | ||
| address = {Miyazaki, Japan}, | ||
| publisher = {European Language Resources Association (ELRA)}, | ||
| url = {https://aclanthology.org/L18-1548} | ||
| } | ||
| ``` | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| #!/bin/bash | ||
| URL="https://huggingface.co/datasets/cfilt/iitb-english-hindi/tree/main/data" | ||
|
|
||
| python utils/get_translation_parquet_dataset.py \ | ||
| --url "$URL" \ | ||
| --prefix en $'\nEN: ' \ | ||
| --prefix hi $'HI: ' \ | ||
| --output input.txt | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../template/prepare.py |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| ../template/utils |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,145 @@ | ||||||
| """Utilities for flattening translation columns in Parquet datasets.""" | ||||||
| from __future__ import annotations | ||||||
|
|
||||||
| import argparse | ||||||
| import json | ||||||
| import os | ||||||
| from typing import Iterable, Sequence, Tuple | ||||||
|
|
||||||
| from get_parquet_dataset import convert_to_json, download_file, find_parquet_links | ||||||
|
|
||||||
|
|
||||||
| def emit_translation_items( | ||||||
| json_path: str, | ||||||
| output_path: str, | ||||||
| language_prefixes: Sequence[Tuple[str, str]], | ||||||
| ) -> None: | ||||||
| """Emit flattened translation rows from ``json_path`` into ``output_path``. | ||||||
|
|
||||||
| Parameters | ||||||
| ---------- | ||||||
| json_path: | ||||||
| Path to the JSON file produced from a Parquet shard. | ||||||
| output_path: | ||||||
| File where the flattened text should be appended. | ||||||
| language_prefixes: | ||||||
| Ordered collection of (language, prefix) tuples. Each translation entry | ||||||
| writes one line per language using the associated prefix when the | ||||||
| translation text is present. | ||||||
| """ | ||||||
| if not language_prefixes: | ||||||
| return | ||||||
|
|
||||||
| with open(json_path, "r", encoding="utf-8") as handle: | ||||||
| records = json.load(handle) | ||||||
|
|
||||||
| if not isinstance(records, list): | ||||||
| return | ||||||
|
|
||||||
| os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True) | ||||||
|
|
||||||
| with open(output_path, "a", encoding="utf-8") as out_handle: | ||||||
| for record in records: | ||||||
| translation = record.get("translation") | ||||||
| if not isinstance(translation, dict): | ||||||
| continue | ||||||
|
|
||||||
| segments = [] | ||||||
| for language, prefix in language_prefixes: | ||||||
| text = translation.get(language) | ||||||
| if not text: | ||||||
| continue | ||||||
| segments.append(f"{prefix}{text}") | ||||||
|
|
||||||
| if segments: | ||||||
| out_handle.write("\n".join(segments) + "\n\n") | ||||||
|
|
||||||
|
|
||||||
| def download_translation_dataset( | ||||||
| url: str, | ||||||
| output_text_file: str, | ||||||
| language_prefixes: Sequence[Tuple[str, str]], | ||||||
| append: bool = False, | ||||||
| ) -> None: | ||||||
| """Download, convert, and flatten translation datasets from ``url``. | ||||||
|
|
||||||
| The function downloads all Parquet files advertised at ``url`` (typically a | ||||||
| Hugging Face dataset folder), converts them to JSON if necessary, and emits | ||||||
| flattened text records to ``output_text_file`` using the provided language | ||||||
| prefixes. | ||||||
| """ | ||||||
| parquet_links = find_parquet_links(url) | ||||||
| download_dir = "./downloaded_parquets" | ||||||
| json_dir = "./json_output" | ||||||
| os.makedirs(download_dir, exist_ok=True) | ||||||
| os.makedirs(json_dir, exist_ok=True) | ||||||
|
|
||||||
| if not append: | ||||||
| open(output_text_file, "w", encoding="utf-8").close() | ||||||
|
|
||||||
| for link in parquet_links: | ||||||
| file_name = link.split("/")[-1].split("?")[0] | ||||||
| parquet_path = os.path.join(download_dir, file_name) | ||||||
| json_path = os.path.join(json_dir, file_name.replace(".parquet", ".json")) | ||||||
|
|
||||||
| if not os.path.exists(parquet_path): | ||||||
| download_file(link, parquet_path) | ||||||
|
|
||||||
| convert_to_json(parquet_path, json_path) | ||||||
| emit_translation_items(json_path, output_text_file, language_prefixes) | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
||||||
Copilot
AI
Oct 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description references 'Europarl-style' which is misleading since this utility is generic and works with any translation dataset in Parquet format, not just Europarl. The description should be updated to reflect its general-purpose nature.
| "Download Europarl-style translation Parquet files and emit prefixed text." | |
| "Download translation Parquet files from any supported dataset and emit prefixed text." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The English prefix includes a leading newline (
$'\\nEN: ') while the Hindi prefix does not ($'HI: '). This inconsistency will cause the first English segment to be separated by an extra blank line from previous content, but Hindi segments won't have this leading separation. Either both prefixes should include the leading newline, or neither should.