Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions data/iitb-english-hindi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# IITB English–Hindi Parallel Corpus (cfilt/iitb-english-hindi)

### Dataset Overview

The **IIT Bombay English-Hindi Parallel Corpus** is a large-scale bilingual
dataset created by the **Center for Indian Language Technology (CFILT)** at IIT
Bombay. It contains **1.66 million English–Hindi sentence pairs** collected
from multiple open sources and curated over several years for **machine
translation and linguistic research**.

| Field | Value |
| --------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Dataset name** | `cfilt/iitb-english-hindi` |
| **Languages** | English (`en`), Hindi (`hi`) |
| **Modality** | Text (parallel corpus) |
| **Format** | Parquet |
| **Size** | ~190 MB (≈ 1.66 M rows) |
| **Splits** | `train`, `validation`, `test` |
| **License** | [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) |
| **Hugging Face page** | 🔗 [https://huggingface.co/datasets/cfilt/iitb-english-hindi](https://huggingface.co/datasets/cfilt/iitb-english-hindi) |
| **Official site** | [http://www.cfilt.iitb.ac.in/iitb_parallel](http://www.cfilt.iitb.ac.in/iitb_parallel) |

---

### 🧠 Example Record

```json
{
"en": "Give your application an accessibility workout",
"hi": "अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें"
}
```

---

🔗 [IITB-English-Hindi-PC GitHub](https://github.com/cfiltnlp/IITB-English-Hindi-PC)

---

### 🧩 Typical Uses

* English↔Hindi machine translation
* Bilingual lexicon extraction
* Cross-lingual representation learning
* Evaluation of translation quality metrics (BLEU, chrF, etc.)

---

### 🧾 Citation

If you use this dataset, please cite:

> **Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya**
> *The IIT Bombay English–Hindi Parallel Corpus*
> *Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)*, Miyazaki, Japan.

```bibtex
@inproceedings{kunchukuttan-etal-2018-iit,
title = {The IIT Bombay English-Hindi Parallel Corpus},
author = {Kunchukuttan, Anoop and Mehta, Pratik and Bhattacharyya, Pushpak},
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year = {2018},
address = {Miyazaki, Japan},
publisher = {European Language Resources Association (ELRA)},
url = {https://aclanthology.org/L18-1548}
}
```

9 changes: 9 additions & 0 deletions data/iitb-english-hindi/get_dataset.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash
URL="https://huggingface.co/datasets/cfilt/iitb-english-hindi/tree/main/data"

python utils/get_translation_parquet_dataset.py \
--url "$URL" \
--prefix en $'\nEN: ' \
--prefix hi $'HI: ' \
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The English prefix includes a leading newline ($'\\nEN: ') while the Hindi prefix does not ($'HI: '). This inconsistency will cause the first English segment to be separated by an extra blank line from previous content, but Hindi segments won't have this leading separation. Either both prefixes should include the leading newline, or neither should.

Suggested change
--prefix hi $'HI: ' \
--prefix hi $'\nHI: ' \

Copilot uses AI. Check for mistakes.
--output input.txt

1 change: 1 addition & 0 deletions data/iitb-english-hindi/prepare.py
1 change: 1 addition & 0 deletions data/iitb-english-hindi/utils
145 changes: 145 additions & 0 deletions data/template/utils/get_translation_parquet_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
"""Utilities for flattening translation columns in Parquet datasets."""
from __future__ import annotations

import argparse
import json
import os
from typing import Iterable, Sequence, Tuple

from get_parquet_dataset import convert_to_json, download_file, find_parquet_links


def emit_translation_items(
json_path: str,
output_path: str,
language_prefixes: Sequence[Tuple[str, str]],
) -> None:
"""Emit flattened translation rows from ``json_path`` into ``output_path``.

Parameters
----------
json_path:
Path to the JSON file produced from a Parquet shard.
output_path:
File where the flattened text should be appended.
language_prefixes:
Ordered collection of (language, prefix) tuples. Each translation entry
writes one line per language using the associated prefix when the
translation text is present.
"""
if not language_prefixes:
return

with open(json_path, "r", encoding="utf-8") as handle:
records = json.load(handle)

if not isinstance(records, list):
return

os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)

with open(output_path, "a", encoding="utf-8") as out_handle:
for record in records:
translation = record.get("translation")
if not isinstance(translation, dict):
continue

segments = []
for language, prefix in language_prefixes:
text = translation.get(language)
if not text:
continue
segments.append(f"{prefix}{text}")

if segments:
out_handle.write("\n".join(segments) + "\n\n")


def download_translation_dataset(
url: str,
output_text_file: str,
language_prefixes: Sequence[Tuple[str, str]],
append: bool = False,
) -> None:
"""Download, convert, and flatten translation datasets from ``url``.

The function downloads all Parquet files advertised at ``url`` (typically a
Hugging Face dataset folder), converts them to JSON if necessary, and emits
flattened text records to ``output_text_file`` using the provided language
prefixes.
"""
parquet_links = find_parquet_links(url)
download_dir = "./downloaded_parquets"
json_dir = "./json_output"
os.makedirs(download_dir, exist_ok=True)
os.makedirs(json_dir, exist_ok=True)

if not append:
open(output_text_file, "w", encoding="utf-8").close()

for link in parquet_links:
file_name = link.split("/")[-1].split("?")[0]
parquet_path = os.path.join(download_dir, file_name)
json_path = os.path.join(json_dir, file_name.replace(".parquet", ".json"))

if not os.path.exists(parquet_path):
download_file(link, parquet_path)

convert_to_json(parquet_path, json_path)
emit_translation_items(json_path, output_text_file, language_prefixes)



Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line before function definition violates PEP 8 style guidelines, which recommends exactly two blank lines between top-level function definitions.

Suggested change

Copilot uses AI. Check for mistakes.
def parse_language_prefixes(prefix_args: Iterable[Tuple[str, str]]) -> Sequence[Tuple[str, str]]:
"""Validate and normalize CLI ``--prefix`` arguments."""
prefixes: list[Tuple[str, str]] = []
for language, prefix in prefix_args:
if not language:
raise ValueError("Language code for --prefix cannot be empty")
prefixes.append((language, prefix))
return prefixes


def main() -> None:
parser = argparse.ArgumentParser(
description=(
"Download Europarl-style translation Parquet files and emit prefixed text."
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description references 'Europarl-style' which is misleading since this utility is generic and works with any translation dataset in Parquet format, not just Europarl. The description should be updated to reflect its general-purpose nature.

Suggested change
"Download Europarl-style translation Parquet files and emit prefixed text."
"Download translation Parquet files from any supported dataset and emit prefixed text."

Copilot uses AI. Check for mistakes.
)
)
parser.add_argument(
"--url",
required=True,
help="Dataset folder URL listing the Parquet shards (e.g. Hugging Face tree view).",
)
parser.add_argument(
"-o",
"--output",
default="input.txt",
help="Where to write the flattened text output.",
)
parser.add_argument(
"--prefix",
nargs=2,
action="append",
metavar=("LANG", "PREFIX"),
required=True,
help="Language/prefix pairs like --prefix bg 'BG: ' --prefix cs 'CS: '.",
)
parser.add_argument(
"--append",
action="store_true",
help="Append to the output file instead of overwriting it.",
)
args = parser.parse_args()

language_prefixes = parse_language_prefixes(args.prefix)
download_translation_dataset(
args.url,
args.output,
language_prefixes,
append=args.append,
)


if __name__ == "__main__":
main()