TuneForge is a powerful dataset generation tool for training and fine-tuning Large Language Models (LLMs). It converts text content into structured training datasets with various formats optimized for different fine-tuning approaches.
- Multiple Dataset Types: Generate diverse dataset types from a single source content
- Format Variety: Support for legacy, standard, modern, and Indian language-specific dataset formats
- Customizable Output: Control sample counts, concurrency, and output formats
- Hugging Face Integration: Direct upload to Hugging Face Hub datasets
- Cross-language Support: Generate multilingual and translation datasets
- TRL Compatibility: Transform data into formats compatible with popular Transformer Reinforcement Learning libraries
- Token Analysis: Analyze token usage in generated datasets to optimize training costs
# Clone the repository
git clone https://github.com/SukrutAI/TuneForge.git
cd TuneForge
# Install dependencies with Bun
bun install
# Generate a basic dataset with default settings
bun run index.ts --input ./data/example.pdf --output ./output
# Generate multiple dataset types
bun run index.ts --input ./data --output ./output --type qa rp instruction
TuneForge supports multiple dataset types grouped into categories:
qa
: Question-answer pairs with varying difficulty levelsrp
: Role-playing scenarios with system instructions and example conversationsclassifier
: Text classification samples with categories and explanationsmultilingual
: Text content in multiple languagesparallel
: Parallel text in source and target languagesinstruction
: Instruction following samplessummarization
: Document summarization pairs
parallel_corpora
: Standardized parallel translation corporamonolingual_text
: Monolingual texts with cultural and origin metadatainstruction_tuning
: Standardized instruction-input-output tripletsbenchmark_evaluation
: Evaluation benchmark datasetsdomain_specific
: Domain-specialized contentweb_crawled
: Web-crawled content with metadata
alpaca_instruct
: Alpaca-style instruction-input-output tripletssharegpt_conversations
: Conversational turns in ShareGPT formatraw_corpus
: Raw text for continued pre-training
indic_summarization
: Article-summary pairs in Indian languagesindic_translation
: English to Indian language translation pairsindic_qa
: Question answering in Indian languagesindic_crosslingual_qa
: Cross-lingual QA with English context and questions in Indian languages
Options:
-V, --version output the version number
-i, --input <path> Input file or directory path (default: "./data")
-o, --output <path> Output directory for generated datasets (default: "./output")
-m, --model <name> AI model to use for generation (default: "gemini-2.0-flash-lite-preview-02-05")
-c, --concurrency <number> Number of chunks to process concurrently (default: "3")
-t, --type <types...> Types of datasets to generate (default: ["qa", "rp", "classifier"])
-s, --samples <number> Number of samples to generate per chunk (default: "3")
-f, --format <format> Output format (json, jsonl, csv, parquet, arrow) (default: "jsonl")
--trl-format <format> TRL format (standard, conversational) (default: "standard")
--trl-type <type> TRL type (language_modeling, prompt_only, prompt_completion, preference, unpaired_preference, stepwise_supervision) (default: "prompt_completion")
--dataset-format <format> Dataset format to use (legacy, standard, modern, indic) (default: "legacy")
--upload Upload datasets to Hugging Face (default: false)
--repo-id <id> Hugging Face repository ID for upload
--private Make Hugging Face repository private (default: false)
--hf-token <token> Hugging Face token for upload
--description <text> Description for Hugging Face dataset
--include-indic Include Indian Indic languages in multilingual datasets (default: false)
--languages <codes> Comma-separated list of language ISO codes to include (e.g., en,hi,ta,bn)
-h, --help display help for command
The Token Analyzer tool helps you analyze the token usage in your generated datasets to optimize training costs and validate dataset quality.
Usage: token-analyze [options] <path>
Options:
-V, --version output the version number
-r, --recursive Recursively analyze subdirectories (default: false)
-o, --output <file> Save analysis results to a JSON file
-v, --verbose Show detailed analysis for each field (default: false)
-s, --summary-only Show only the summary, not per-file details (default: false)
-h, --help display help for command
# Generate QA pairs from a PDF file
bun run index.ts --input ./data/document.pdf --output ./output --type qa
# Generate multiple dataset types
bun run index.ts --input ./data --output ./output --type qa rp instruction
# Generate 10 samples of Alpaca format with conversational TRL format
bun run index.ts --input ./data --output ./output --type alpaca_instruct --samples 10 --trl-format conversational
# Process a directory with 5 concurrent chunks and multiple dataset types
bun run index.ts --input ./data --output ./output --concurrency 5 --type qa rp instruction --samples 8
# Generate datasets in Indian languages
bun run index.ts --input ./data --output ./output --type indic_translation indic_summarization --include-indic
# Generate specific language pairs
bun run index.ts --input ./data --output ./output --type parallel --languages en,hi,bn,ta
# Generate and upload datasets to Hugging Face
bun run index.ts --input ./data --output ./output --type instruction --upload --repo-id yourusername/dataset-name --hf-token YOUR_HF_TOKEN
# Analyze token usage in a specific dataset file
bun run src/cli/tokenAnalysis.ts ./output/document_qa.jsonl
# Analyze all datasets in a directory recursively and export results to JSON
bun run src/cli/tokenAnalysis.ts ./output -r -o token-analysis.json
# Get a summary of token usage across all datasets
bun run src/cli/tokenAnalysis.ts ./output -r -s
TuneForge supports several output formats:
json
: JSON files with one array containing all recordsjsonl
: JSON Lines with one record per line (default, most compatible)csv
: Comma Separated Valuesparquet
: Apache Parquet columnar storagearrow
: Apache Arrow columnar format
from datasets import load_dataset
# Load JSONL format
dataset = load_dataset("json", data_files="./output/document_qa.jsonl")
# Load CSV format
dataset = load_dataset("csv", data_files="./output/document_instruction.csv")
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
training_args = TrainingArguments(
output_dir="./fine-tuned-model",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-5,
num_train_epochs=3,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
Use the --dataset-format
option to select the category of dataset formats:
legacy
: Original formats like QA pairs, role-playing scenariosstandard
: Standardized formats optimized for modern LLM fine-tuningmodern
: Popular formats like Alpaca, ShareGPT (best for instruction tuning)indic
: Specialized formats for Indian languages based on IndicGenBench
For Transformer Reinforcement Learning compatibility, configure:
-
TRL Format (
--trl-format
):standard
: Traditional supervised fine-tuning formatconversational
: Turn-based conversation format
-
TRL Type (
--trl-type
):language_modeling
: Basic language modelingprompt_only
: Only instruction/prompt textprompt_completion
: Instruction with completion pairspreference
: Preference-based learning pairsunpaired_preference
: Unpaired preference datastepwise_supervision
: Step-by-step supervision
Control language output with:
--include-indic
: Include Indian languages like Hindi, Bengali, Tamil, etc.--languages
: Specify exact ISO codes (e.g., en,hi,ta,bn,te,kn,ml,mr,pa,gu,ur)
TuneForge/
├── src/
│ ├── cli/ # CLI and processing modules
│ │ ├── index.ts # Main CLI entry point
│ │ ├── processingEngine.ts # Core processing logic
│ │ └── tokenAnalysis.ts # Token analysis tool
│ ├── config/ # Configuration settings
│ ├── formatters/ # Format conversion utilities
│ ├── generators/ # Dataset generation modules
│ ├── parsers/ # Content parsing utilities
│ ├── services/ # External service integrations
│ ├── types/ # TypeScript type definitions
│ └── utils/ # Helper utilities
├── data/ # Input data directory
└── output/ # Generated dataset output
- Bun runtime v1.0.0+
- Access to an AI model API (e.g., Google's Gemini)
- Optional: Hugging Face account and API token for uploads
This project is provided under the MIT License.
This project was created using bun init
in bun v1.2.5+. Bun is a fast all-in-one JavaScript runtime.