Skip to content

OpenPecha/milvus_segment_generator

Repository files navigation

Milvus Segment Generator


OpenPecha

Multi-language text segmentation library using the Gemma tokenizer from HuggingFace Transformers.

Owner(s)

Table of contents

Project descriptionWho this project is forProject dependenciesInstructions for useContributing guidelinesAdditional documentationHow to get helpTerms of use


Project description

Milvus Segment Generator helps you tokenize and segment text into fixed-size chunks with character-level span information. It supports multiple languages (Tibetan, English, Chinese) with language-specific delimiter handling and token post-processing rules.

Features

  • Multi-language support: Tibetan, English, and Chinese
  • Gemma tokenizer: Uses HuggingFace Transformers' Gemma model tokenizer
  • Language-specific rules: Custom delimiters and token merging for each language
  • Character spans: Returns precise character offsets for each segment
  • JSON export: Save segmentation results to JSON format

Project dependencies

Before using Milvus Segment Generator, ensure you have:

  • Python 3.8 or higher
  • pip package manager
  • HuggingFace account (for downloading Gemma model tokenizer)

Instructions for use

Installation

  1. Clone the repository:
git clone https://github.com/OpenPecha/milvus_segment_generator.git
cd milvus_segment_generator
  1. Install dependencies:
pip install -e .

This will install:

  • transformers>=4.30.0 - HuggingFace Transformers library
  • torch>=2.0.0 - PyTorch (required by transformers)
  1. (Optional) For development, install dev dependencies:
pip install -e ".[dev]"

Quick Start

from milvus_segment_generator import segment_text, segment_text_to_json

# Segment Tibetan text
tibetan_text = "བཅོམ་ལྡན་འདས། དེ་བཞིན་གཤེགས་པ།"
spans = segment_text(tibetan_text, lang="tibetan", segment_size=2000)
print(spans)
# [{"span": {"start": 0, "end": 15}}, {"span": {"start": 15, "end": 30}}]

# Save to JSON file
segment_text_to_json(
    tibetan_text,
    lang="bo",
    output_path="output/segments.json",
    segment_size=2000
)

Supported Languages

  • Tibetan: tibetan, bo
  • English: english, en
  • Chinese: chinese, zh

API Reference

segment_text(text, lang, segment_size=1990)

Tokenize and segment text into chunks.

Parameters:

  • text (str): Input text to segment
  • lang (str): Language code
  • segment_size (int): Maximum tokens per segment (default: 1990)

Returns:

  • List of dictionaries with span containing start and end character offsets

segment_text_to_json(text, lang, output_path, segment_size=1990)

Segment text and save to JSON file.

Parameters:

  • text (str): Input text to segment
  • lang (str): Language code
  • output_path (str | Path): Output file path
  • segment_size (int): Maximum tokens per segment (default: 1990)

Returns:

  • Path object pointing to the created JSON file

Troubleshooting

Issue Solution
ImportError: No module named 'transformers' Install transformers: pip install transformers torch
HuggingFace authentication error Login to HuggingFace: huggingface-cli login or set HF_TOKEN environment variable
ValueError: No delimiter found within window Your text segment doesn't contain any delimiters within the segment_size. Add appropriate punctuation or increase segment_size.
Model download is slow The first run downloads the Gemma tokenizer (~500MB). Subsequent runs use cached version.

Environment Variables

  • HF_TOKEN: HuggingFace API token for model access
  • TRANSFORMERS_CACHE: Directory for caching downloaded models (default: ~/.cache/huggingface)

Contributing guidelines

If you'd like to help out, check out our contributing guidelines.

Additional documentation

For more information:

How to get help

  • File an issue.
  • Email us at openpecha[at]gmail.com.
  • Join our discord.

Terms of use

Milvus Segment Generator is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages