Learning-From-Mistakes Prompting for Indigenous Language Translation

This repository contains the codebase for the method described in our paper: "Learning-From-Mistakes Prompting for Indigenous Language Translation," presented at LoResMT 2024.

Our method, Learning-From-Mistakes Prompting (LFM), is a feedback-driven framework that enhances low-resource machine translation by iteratively refining translations based on error analysis and targeted adjustments. This significantly improves translation performance for indigenous language datasets.

For more details, please refer to the paper:

LoResMT@ACL 2024: View on ACL Anthology
arXiv: Read on arXiv

By making this code publicly available, we hope to facilitate further research and encourage collaboration in the field of low-resource language machine translation.

Introduction

Machine translation for indigenous languages faces unique challenges such as limited datasets, complex linguistic structures, and diverse dialects. This repository provides the code to reproduce the results of our paper, proposing three key methods:

KNN-Prompting with Retrieved Prompting Context (RPC): Enhances context by leveraging similar examples.
Chain-of-Thought (CoT) Prompting: Guides the model through a structured reasoning process before producing translations, enabling more accurate linguistic outputs.
Learning-From-Mistakes (LFM) Prompting: Iteratively refines translations using feedback-driven error corrections.

Together, these methods bridge gaps in low-resource translation, making machine translation more accessible and effective for indigenous languages.

Datasets

Our framework is designed to be adaptable to any low-resource language where a parallel corpus can be obtained. A parallel corpus includes aligned linguistic data between a high-resource language (e.g., English or Chinese) and a low-resource language, which can range from sentence-level pairs to word-level mappings, such as bilingual dictionaries.

While our experiments focus on datasets from indigenous languages sourced from Klokah Platform, including Southern Amis, Coastal Amis, and others, the method itself is language-agnostic. This ensures that it is broadly applicable to any low-resource language with minimal parallel resources.

Key Highlights

Minimal Data Requirement: Works with sentence- or word-level alignments between a high-resource and a low-resource language, making it accessible for under-resourced settings.
Versatile and Scalable: Adaptable to a wide range of low-resource languages, addressing challenges like sparse datasets and complex linguistic structures.
Universal Applicability: Proven effective on indigenous languages but designed for broader use, empowering translation for any underrepresented language.

This adaptability ensures the framework's relevance across diverse translation tasks, fostering progress in machine translation for underserved linguistic communities worldwide.

Installation

To get started, clone this repository:

git clone https://github.com/Rui0828/Learning-From-Mistakes-Prompting.git
cd Learning-From-Mistakes-Prompting

Requirements

Python: >= 3.8
Docker: >= 20.10 (for Docker-based setups)

Installation Options

Option 1: Using Docker Compose (Recommended)

Build the Docker Compose environment:

docker-compose build

Option 2: Using Docker Directly

Build the Docker image:

docker build -t lfm-prompting .

Option 3: Manual Setup

Set up a virtual environment (optional but recommended):

python -m venv env  
source env/bin/activate  # On Windows: .\env\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt  
```

Set up your OpenAI API key:

Export the OPENAI_API_KEY environment variable before running the code:
```
export OPENAI_API_KEY="your_api_key"  
```
Alternatively, create a .env file in the project root with the following content:
```
OPENAI_API_KEY=your_api_key  
```

Usage

You can run the program in two modes:

Single Sentence Translation: Translate a single Chinese sentence to the target language.
Batch Translation and Evaluation: Use the --batch option to automatically split data into a test set (default: 100 sentences) and a datastore, perform batch translations, and evaluate results using BLEU scores. Results are saved in the ./results directory.

Commands

Using Docker Compose (Recommended)

Single Translation:

docker-compose run lfm-prompting "{input chinese sentence}"

Batch Translation and Evaluation:

docker-compose run lfm-prompting --batch

Using Docker

Single Translation

docker run --rm -v "$(pwd):/app" -w /app lfm-prompting "{input chinese sentence}"

Batch Translation and Evaluation:

docker run --rm -v "$(pwd):/app" -w /app lfm-prompting --batch

Using Manual Setup

Single Translation

python -m src.main "{input chinese sentence}"

Batch Translation and Evaluation:
```
python -m src.main --batch  
```

Results

Below are the results for Southern Amis translation using different methods. Metrics include BLEU and chrF++. Our LFM Prompting outperforms all other approaches.

Methods	BLEU1 (STD)	BLEU2 (STD)	BLEU3 (STD)	chrF++ (STD)
Zeroshot	1.0	0.0	0.0	3.9
20-shots	18.0	4.9	1.9	16.3
Knn-Prompting (k=5)	30.1	14.4	6.9	28.1
Knn-Prompting (k=10)	33.3	16.4	8.0	34.2
Knn-Prompting with RPC (k=5)	38.2 ± 2.2	10.5 ± 1.8	4.3 ± 1.1	41.2 ± 1.1
Knn-Prompting with RPC (k=10)	37.8 ± 2.2	12.5 ± 3.2	5.2 ± 1.9	41.5 ± 1.6
CoT Prompting	44.4 ± 1.5	14.3 ± 0.6	5.9 ± 1.1	43.5 ± 0.3
LFM Prompting	44.4 ± 2.7	17.5 ± 1.8	8.2 ± 1.7	44.9 ± 1.9

The results show that LFM Prompting achieves the highest BLEU and chrF++ scores, significantly outperforming traditional approaches, especially in low-resource settings.

Troubleshooting

If you encounter issues, here are some common solutions:

Missing API Key: Ensure your OpenAI API key is correctly set as an environment variable or in a .env file.
Dependency Errors: Verify Python dependencies using pip install -r requirements.txt. Consider using a virtual environment.
Docker Errors: Ensure Docker is installed and running. Check your Docker version compatibility.

Citation

If you use this code in your research, please cite our paper:

@inproceedings{liao2024learning,
  title={Learning-From-Mistakes Prompting for Indigenous Language Translation},
  author={Liao, You Cheng and Yu, Chen-Jui and Lin, Chi-Yi and Yun, He-Feng and Wang, Yen-Hsiang and Li, Hsiao-Min and Fan, Yao-Chung},
  booktitle={Proceedings of the The Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)},
  pages={146--158},
  year={2024}
}

Acknowledgments

This work was conducted during my undergraduate studies at the Natural Language Processing Lab, National Chung Hsing University (NCHU). I would like to thank my advisor and colleagues for their invaluable guidance and support throughout this project.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Note: This repository uses OpenAI API for certain functionalities. Users are responsible for ensuring compliance with OpenAI's usage policies.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
paper		paper
result		result
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning-From-Mistakes Prompting for Indigenous Language Translation

Table of Contents

Introduction

Datasets

Key Highlights

Installation

Requirements

Installation Options

Option 1: Using Docker Compose (Recommended)

Option 2: Using Docker Directly

Option 3: Manual Setup

Set up your OpenAI API key:

Usage

Commands

Using Docker Compose (Recommended)

Using Docker

Using Manual Setup

Results

Troubleshooting

Citation

Acknowledgments

License

About

Releases

Packages

Languages

License

Rui0828/Learning-From-Mistakes-Prompting

Folders and files

Latest commit

History

Repository files navigation

Learning-From-Mistakes Prompting for Indigenous Language Translation

Table of Contents

Introduction

Datasets

Key Highlights

Installation

Requirements

Installation Options

Option 1: Using Docker Compose (Recommended)

Option 2: Using Docker Directly

Option 3: Manual Setup

Set up your OpenAI API key:

Usage

Commands

Using Docker Compose (Recommended)

Using Docker

Using Manual Setup

Results

Troubleshooting

Citation

Acknowledgments

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages