Skip to content

Fast Translation module with Ctranslate2 banckend(T5, BART etc.)

License

Notifications You must be signed in to change notification settings

sawradip/faster-translate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Faster Translate

PyPI Downloads Monthly Downloads GitHub License PyPI Version

A high-performance translation library powered by state-of-the-art models. Faster Translate offers optimized inference using CTranslate2 and vLLM backends, providing an easy-to-use interface for applications requiring efficient and accurate translations.

🚀 Features

  • High-performance inference using CTranslate2 and vLLM backends
  • Seamless integration with Hugging Face models
  • Flexible API for single sentence, batch, and large-scale translation
  • Dataset translation with direct Hugging Face integration
  • Multi-backend support for both traditional (CTranslate2) and LLM-based (vLLM) models
  • Text normalization for improved translation quality

đŸ“Ļ Installation

Basic Installation

pip install faster-translate

With vLLM Support (Recommended)

pip install faster-translate[vllm]

All Features

pip install faster-translate[all]

🔍 Usage

Basic Translation

from faster_translate import TranslatorModel

# Initialize with a pre-configured model
translator = TranslatorModel.from_pretrained("banglanmt_bn2en")

# Translate a single sentence
english_text = translator.translate_single("āĻĻā§‡āĻļā§‡ āĻŦāĻŋāĻĻā§‡āĻļāĻŋ āĻ‹āĻŖ āĻ¨āĻŋāĻ¯āĻŧā§‡ āĻāĻ–āĻ¨ āĻŦā§‡āĻļ āĻ†āĻ˛ā§‹āĻšāĻ¨āĻž āĻšāĻšā§āĻ›ā§‡āĨ¤")
print(english_text)

# Translate a batch of sentences
bengali_sentences = [
    "āĻĻā§‡āĻļā§‡ āĻŦāĻŋāĻĻā§‡āĻļāĻŋ āĻ‹āĻŖ āĻ¨āĻŋāĻ¯āĻŧā§‡ āĻāĻ–āĻ¨ āĻŦā§‡āĻļ āĻ†āĻ˛ā§‹āĻšāĻ¨āĻž āĻšāĻšā§āĻ›ā§‡āĨ¤",
    "āĻ°āĻžāĻ¤ āĻ¤āĻŋāĻ¨āĻŸāĻžāĻ° āĻĻāĻŋāĻ•ā§‡ āĻ•āĻžāĻāĻšāĻžāĻŽāĻžāĻ˛ āĻ¨āĻŋāĻ¯āĻŧā§‡ āĻ—ā§āĻ˛āĻŋāĻ¸ā§āĻ¤āĻžāĻ¨ āĻĨā§‡āĻ•ā§‡ āĻĒā§āĻ°āĻžāĻ¨ āĻĸāĻžāĻ•āĻžāĻ° āĻļā§āĻ¯āĻžāĻŽāĻŦāĻžāĻœāĻžāĻ°ā§‡āĻ° āĻ†āĻĄāĻŧāĻ¤ā§‡ āĻ¯āĻžāĻšā§āĻ›āĻŋāĻ˛ā§‡āĻ¨ āĻ˛āĻŋāĻŸāĻ¨ āĻŦā§āĻ¯āĻžāĻĒāĻžāĻ°ā§€āĨ¤"
]
translations = translator.translate_batch(bengali_sentences)

Using Different Model Backends

# Using a CTTranslate2-based model
ct2_translator = TranslatorModel.from_pretrained("banglanmt_bn2en")

# Using a vLLM-based model
vllm_translator = TranslatorModel.from_pretrained("bangla_qwen_en2bn")

Loading Models from Hugging Face

# Load a specific model from Hugging Face
translator = TranslatorModel.from_pretrained(
    "sawradip/faster-translate-banglanmt-bn2en-t5",
    normalizer_func="buetnlpnormalizer"
)

Translating Hugging Face Datasets

Translate an entire dataset with a single function call:

translator = TranslatorModel.from_pretrained("banglanmt_en2bn")

# Translate the entire dataset
translator.translate_hf_dataset(
    "sawradip/bn-translation-mega-raw-noisy", 
    batch_size=16
)

# Translate specific subsets
translator.translate_hf_dataset(
    "sawradip/bn-translation-mega-raw-noisy",
    subset_name=["google"], 
    batch_size=16
)

# Translate a portion of the dataset
translator.translate_hf_dataset(
    "sawradip/bn-translation-mega-raw-noisy",
    subset_name="alt",
    batch_size=16, 
    translation_size=0.5  # Translate 50% of the dataset
)

Publishing Translated Datasets

Push translated datasets directly to Hugging Face:

translator.translate_hf_dataset(
    "sawradip/bn-translation-mega-raw-noisy",
    subset_name="alt",
    batch_size=16, 
    push_to_hub=True,
    token="your_huggingface_token",
    save_repo_name="your-username/translated-dataset"
)

Enhanced vLLM HuggingFace Dataset Support

When working with challenging datasets, you can use the verification_mode="no_checks" parameter:

# Using vLLM backend with relaxed dataset verification
vllm_translator = TranslatorModel.from_pretrained("bangla_qwen_en2bn")

# Translate dataset with relaxed verification checks
vllm_translator.translate_hf_dataset(
    "difficult/dataset-with-formatting-issues",
    batch_size=16,
    # The verification_mode parameter is automatically set to "no_checks"
    # to handle datasets with formatting inconsistencies
)

Advanced Dataset Translation Options

translate_hf_dataset supports numerous parameters for fine-grained control:

# Full parameter example
translator.translate_hf_dataset(
    dataset_repo="example/dataset",           # HuggingFace dataset repository
    subset_name="subset",                     # Optional dataset subset
    split=["train", "validation"],            # Dataset splits to translate (default: ["train"])
    columns=["text", "instructions"],         # Columns to translate
    batch_size=32,                            # Number of texts per batch
    token="hf_token",                         # HuggingFace token for private datasets
    translation_size=0.7,                     # Translate 70% of the dataset
    start_idx=100,                            # Start from the 100th example
    end_idx=1000,                             # End at the 1000th example
    output_format="json",                     # Output format
    output_name="translations.json",          # Output file name
    push_to_hub=True,                         # Push translated dataset to HF Hub
    save_repo_name="username/translated-data" # Repository name for upload
)

## 🌐 Supported Models

| Model ID | Source Language | Target Language | Backend | Description |
|----------|----------------|----------------|---------|-------------|
| `banglanmt_bn2en` | Bengali | English | CTranslate2 | BanglaNMT model from BUET |
| `banglanmt_en2bn` | English | Bengali | CTranslate2 | BanglaNMT model from BUET |
| `bangla_mbartv1_en2bn` | English | Bengali | CTranslate2 | MBart-based translation model |
| `bangla_qwen_en2bn` | English | Bengali | vLLM | Qwen-based translation model |

## 🛠ī¸ Advanced Configuration

### Custom Sampling Parameters for vLLM Models

```python
from vllm import SamplingParams

# Create custom sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Initialize translator with custom parameters
translator = TranslatorModel.from_pretrained(
    "bangla_qwen_en2bn", 
    sampling_params=sampling_params
)

đŸ’Ē Contributors

List of Contributors

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use Faster Translate in your research, please cite:

@software{faster_translate,
  author = {Sawradip Saha and Contributors},
  title = {Faster Translate: High-Performance Machine Translation Library},
  url = {https://github.com/sawradip/faster-translate},
  year = {2024},
}

About

Fast Translation module with Ctranslate2 banckend(T5, BART etc.)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •