Add scripts compat with hindi eng dataset #661

Copilot · 2025-10-19T05:22:46Z

The English prefix includes a leading newline ($'\\nEN: ') while the Hindi prefix does not ($'HI: '). This inconsistency will cause the first English segment to be separated by an extra blank line from previous content, but Hindi segments won't have this leading separation. Either both prefixes should include the leading newline, or neither should.

Suggested change

--prefix hi $'HI: ' \

--prefix hi $'\nHI: ' \

Copilot · 2025-10-19T05:22:47Z

Extra blank line before function definition violates PEP 8 style guidelines, which recommends exactly two blank lines between top-level function definitions.

Suggested change

Copilot · 2025-10-19T05:22:47Z

The description references 'Europarl-style' which is misleading since this utility is generic and works with any translation dataset in Parquet format, not just Europarl. The description should be updated to reflect its general-purpose nature.

Suggested change

"Download Europarl-style translation Parquet files and emit prefixed text."

"Download translation Parquet files from any supported dataset and emit prefixed text."

-Original file line number
+Diff line change
@@ -0,0 +1,68 @@
+    # IITB English–Hindi Parallel Corpus (cfilt/iitb-english-hindi)
+    ### Dataset Overview
+    The **IIT Bombay English-Hindi Parallel Corpus** is a large-scale bilingual
+    dataset created by the **Center for Indian Language Technology (CFILT)** at IIT
+    Bombay. It contains **1.66 million English–Hindi sentence pairs** collected
+    from multiple open sources and curated over several years for **machine
+    translation and linguistic research**.
+    | Field                 | Value                                                                                                                   |
+    | --------------------- | ----------------------------------------------------------------------------------------------------------------------- |
+    | **Dataset name**      | `cfilt/iitb-english-hindi`                                                                                              |
+    | **Languages**         | English (`en`), Hindi (`hi`)                                                                                            |
+    | **Modality**          | Text (parallel corpus)                                                                                                  |
+    | **Format**            | Parquet                                                                                                                 |
+    | **Size**              | ~190 MB (≈ 1.66 M rows)                                                                                                 |
+    | **Splits**            | `train`, `validation`, `test`                                                                                           |
+    | **License**           | [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)                                                         |
+    | **Hugging Face page** | 🔗 [https://huggingface.co/datasets/cfilt/iitb-english-hindi](https://huggingface.co/datasets/cfilt/iitb-english-hindi) |
+    | **Official site**     | [http://www.cfilt.iitb.ac.in/iitb_parallel](http://www.cfilt.iitb.ac.in/iitb_parallel)                                  |
+    ---
+    ### 🧠 Example Record
+    ```json
+    {
+      "en": "Give your application an accessibility workout",
+      "hi": "अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें"
+    }
+    ```
+    ---
+    🔗 [IITB-English-Hindi-PC GitHub](https://github.com/cfiltnlp/IITB-English-Hindi-PC)
+    ---
+    ### 🧩 Typical Uses
+    * English↔Hindi machine translation
+    * Bilingual lexicon extraction
+    * Cross-lingual representation learning
+    * Evaluation of translation quality metrics (BLEU, chrF, etc.)
+    ---
+    ### 🧾 Citation
+    If you use this dataset, please cite:
+    > **Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya**
+    > *The IIT Bombay English–Hindi Parallel Corpus*
+    > *Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)*, Miyazaki, Japan.
+    ```bibtex
+    @inproceedings{kunchukuttan-etal-2018-iit,
+      title     = {The IIT Bombay English-Hindi Parallel Corpus},
+      author    = {Kunchukuttan, Anoop and Mehta, Pratik and Bhattacharyya, Pushpak},
+      booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
+      year      = {2018},
+      address   = {Miyazaki, Japan},
+      publisher = {European Language Resources Association (ELRA)},
+      url       = {https://aclanthology.org/L18-1548}
+    }
+    ```

Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../template/prepare.py

Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../template/utils

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scripts compat with hindi eng dataset #661

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Copilot AI Oct 19, 2025

Uh oh!

Add scripts compat with hindi eng dataset #661

Are you sure you want to change the base?

Uh oh!

Add scripts compat with hindi eng dataset #661

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!