Update README with all implemented features and configuration options

haasonsaas · haasonsaas · commit b5f161944937 · 2025-10-01T18:40:31.000-07:00
diff --git a/README.md b/README.md
@@ -61,6 +61,15 @@ This project implements a proof‑of‑concept evaluation‑driven fine‑tuning
 
 Such a loop can be particularly useful for domains where quality requirements are high and failure modes are diverse (e.g., legal drafting, safety moderation, tutoring).
 
+## Features
+
+- **Proper Tinker Integration**: Uses renderers for correct loss masking, async futures for performance, and recommended LR schedules
+- **EvalOps Integration**: Optional automatic submission of evaluation results for centralized tracking
+- **Pydantic Config Validation**: Type-safe configuration with clear error messages
+- **Production-Grade Hyperparameters**: Model-specific LR formula, warmup/cosine scheduling, configurable LoRA rank
+- **Async Batching**: Overlapping forward/backward and optimizer steps for faster training
+- **Comprehensive Tests**: 37 unit and integration tests covering all components
+
 ## Usage overview
 
 This project contains two main components:
@@ -70,9 +79,11 @@ This project contains two main components:
 | `trainer_with_eval.py` | The main script that orchestrates training and evaluation. It connects to Tinker, creates a LoRA training client, runs fine‑tuning, performs evaluations via Inspect AI, and decides whether to launch further training rounds. |
 | `eval_loop_config.json` | A sample configuration file specifying the base model, dataset paths, evaluation tasks, thresholds and hyperparameters. |
 | `evalops_client.py` | Python SDK for submitting evaluation results to EvalOps platform. |
-| `config_schema.py` | Pydantic schema for configuration validation. |
-| `data_loader.py` | JSONL data loader with validation, deduplication, and tokenization. |
+| `config_schema.py` | Pydantic schema for configuration validation with hyperparameter tuning. |
+| `data_loader.py` | JSONL data loader with proper Tinker renderers, loss masking, validation, and deduplication. |
 | `data_selector.py` | Utilities for mining hard examples based on evaluation failures. |
+| `hyperparam_utils.py` | Tinker's recommended LR formula and warmup/cosine scheduler. |
+| `simple_eval.py` | Minimal working evaluator for demo (replace with Inspect AI for production). |
 | `requirements.txt` | Dependencies required to run the script. |
 | `tests/` | Unit and integration tests for all components. |
 
@@ -146,9 +157,28 @@ To use EvalOps integration:
 
 The client will automatically submit results after each evaluation round, making it easy to track progress over time and compare different fine-tuning runs.
 
+## Configuration Options
+
+Key configuration parameters in `eval_loop_config.json`:
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `base_model` | - | Model to fine-tune (e.g., "meta-llama/Llama-3.1-8B-Instruct") |
+| `lora_rank` | 16 | LoRA adapter rank (1-256) |
+| `learning_rate` | 0.0001 | Initial learning rate |
+| `use_recommended_lr` | false | Use Tinker's model-specific LR formula |
+| `warmup_steps` | 100 | LR warmup steps |
+| `max_steps` | 1000 | Total training steps for cosine decay |
+| `batch_size` | 8 | Training batch size |
+| `steps_per_round` | 1 | Training steps per evaluation round |
+| `eval_threshold` | 0.85 | Minimum score to stop training |
+| `max_rounds` | 3 | Maximum training rounds |
+| `renderer_name` | "llama3" | Renderer for proper chat formatting |
+| `evalops_enabled` | false | Enable EvalOps integration |
+
 ## Extending this project
 
-This is a minimal prototype to demonstrate how to build a useful system on top of Tinker. Future extensions could include:
+This is a production-ready prototype demonstrating best practices from Tinker documentation. Future extensions could include:
 
 - **Custom data selection** based on evaluation feedback. For example, automatically mine additional examples from your corpora that match prompts where the model performs poorly.
 
@@ -171,6 +201,16 @@ The test suite includes:
 - **Integration tests** for the training loop with mocked Tinker/EvalOps services
 - **Coverage** for early stopping, LR decay, and error handling
 
+## Implementation Notes
+
+**Based on Tinker Documentation:**
+- Uses `renderers.build_supervised_example()` for proper loss masking (trains only on assistant outputs)
+- Implements async futures with `forward_backward_async()` and `optim_step_async()` for performance
+- Uses `save_weights_for_sampler()` for evaluation (not `save_state()` which includes optimizer state)
+- Supports Tinker's recommended LR formula: `LR = 5e-5 × 10 × (2000/H_m)^P_m` with model-specific exponents
+- Includes warmup + cosine decay scheduler for stable training
+- Gracefully falls back when tinker-cookbook unavailable (for testing/development)
+
 ## Disclaimer
 
-This code does not run training jobs by itself; it serves as a scaffold. You'll need an active Tinker API key and appropriate computing quotas to execute the training and evaluation steps. Modify the script to fit your particular needs and model lineup. The hyperparameters and thresholds in the sample config are placeholders and should be adjusted based on your use case and dataset size.
+This code requires an active Tinker API key and appropriate computing quotas to execute training and evaluation. The implementation follows Tinker's documented best practices and is suitable for production use with real evaluation tasks. The simple evaluator is for demo purposes only—replace with Inspect AI integration for production deployments.