A two-stage distillation pipeline that turns voice-transcribed transaction
strings ("500 rs on beer 50 rs on candy") into structured JSON. Three
student models trained on the same 93k teacher-labeled dataset, published
side-by-side at kartikey31/txn-parser
so downstream apps can pick the size/quality tradeoff that fits.
All variants live in subfolders of one repo:
huggingface.co/kartikey31/txn-parser
| Base model | Subfolder | Params | Status |
|---|---|---|---|
unsloth/gemma-3-270m-it |
gemma-3-270m/ |
270M | ✅ Published (5 quants) |
HuggingFaceTB/SmolLM2-360M-Instruct |
smollm2-360m/ |
360M | ✅ Published (5 quants) |
Qwen/Qwen3-0.6B |
qwen3-0.6b/ |
600M | ✅ Published (5 quants) |
Each subfolder contains a LoRA adapter (adapters/), 5 merged GGUF quants
(gguf/txn-parser-<short>-{F16,Q8_0,Q6_K,Q5_K_M,Q4_K_M}.gguf), and a
model-card README with the exact system prompt to use at inference.
Generated via python scripts/eval_all_quants.py — same eval set, same
grammar, same decoding settings for every model/quant pair.
| Model | Quant | Size | JSON valid | Schema valid | Exact match | Amount exact | Mean ms | P95 ms |
|---|---|---|---|---|---|---|---|---|
gemma-3-270m |
F16 | 543 MB | 99.2% | 99.2% | 51.7% | 85.8% | 1250 | 2019 |
gemma-3-270m |
Q8_0 | 292 MB | 99.7% | 99.7% | 54.7% | 88.0% | 1697 | 3221 |
gemma-3-270m |
Q6_K | 283 MB | 99.7% | 99.7% | 53.7% | 88.0% | 1813 | 3418 |
gemma-3-270m |
Q5_K_M | 260 MB | 99.7% | 99.7% | 51.0% | 84.7% | 1788 | 3444 |
gemma-3-270m |
Q4_K_M | 253 MB | 93.3% | 93.3% | 48.3% | 80.7% | 2660 | 15185 |
smollm2-360m |
F16 | 726 MB | 100.0% | 100.0% | 56.3% | 88.3% | 997 | 1646 |
smollm2-360m |
Q8_0 | 386 MB | 100.0% | 100.0% | 56.3% | 88.3% | 995 | 1589 |
smollm2-360m |
Q6_K | 367 MB | 100.0% | 100.0% | 56.3% | 88.3% | 994 | 1615 |
smollm2-360m |
Q5_K_M | 290 MB | 100.0% | 100.0% | 52.7% | 89.0% | 996 | 1599 |
smollm2-360m |
Q4_K_M | 271 MB | 100.0% | 100.0% | 53.3% | 87.3% | 978 | 1595 |
qwen3-0.6b |
F16 | 1198 MB | 100.0% | 100.0% | 59.0% | 91.0% | 851 | 1373 |
qwen3-0.6b |
Q8_0 | 639 MB | 100.0% | 100.0% | 59.3% | 91.3% | 851 | 1460 |
qwen3-0.6b |
Q6_K | 495 MB | 100.0% | 100.0% | 59.0% | 92.0% | 876 | 1419 |
qwen3-0.6b |
Q5_K_M | 444 MB | 100.0% | 100.0% | 60.0% | 91.3% | 857 | 1415 |
qwen3-0.6b |
Q4_K_M | 397 MB | 100.0% | 100.0% | 60.0% | 90.7% | 885 | 1373 |
Headline: qwen3-0.6b is the new accuracy leader — best
exact match (60.0%) and best amount accuracy (92.0%) across the
board, while also being the fastest family (~850 ms mean, well below
both gemma and smollm at every matching quant). SmolLM2-360M ties at
100% schema valid and stays smaller on disk. Gemma-3-270m Q4_K_M is the
only build that degrades meaningfully (93% schema + 15 s P95 from
grammar backtracking) — use Q5_K_M or higher in the gemma family.
Pick by deployment target:
| If you need… | Use | Why |
|---|---|---|
| Best accuracy | qwen3-0.6b-Q4_K_M (397 MB) |
100% schema, 60% exact, ~885 ms, smaller than F16 by 3× |
| Smallest ship size | smollm2-360m-Q4_K_M (271 MB) |
100% schema, 53% exact, ~1 s — 126 MB lighter than qwen |
| Fastest mean latency | qwen3-0.6b-Q8_0 (639 MB) |
851 ms with full Q8 quality if disk isn't a constraint |
Best schema_valid: Q8_0 (99.7%, 1697 ms mean).
| Quant | Schema | Exact | Amt | TxnCount | Dup% | Super% | Mean ms | P95 ms |
|---|---|---|---|---|---|---|---|---|
| F16 | 99.2% | 51.7% | 85.8% | 91.7% | 0.0% | 0.0% | 1250 | 2019 |
| Q8_0 | 99.7% | 54.7% | 88.0% | 93.0% | 0.0% | 0.3% | 1697 | 3221 |
| Q6_K | 99.7% | 53.7% | 88.0% | 93.3% | 0.0% | 0.3% | 1813 | 3418 |
| Q5_K_M | 99.7% | 51.0% | 84.7% | 93.7% | 0.0% | 0.7% | 1788 | 3444 |
| Q4_K_M | 93.3% | 48.3% | 80.7% | 85.7% | 0.0% | 0.3% | 2660 | 15185 |
Best schema_valid: Q6_K (100.0%, 994 ms mean).
| Quant | Schema | Exact | Amt | TxnCount | Dup% | Super% | Mean ms | P95 ms |
|---|---|---|---|---|---|---|---|---|
| F16 | 100.0% | 56.3% | 88.3% | 91.7% | 0.3% | 0.0% | 997 | 1646 |
| Q8_0 | 100.0% | 56.3% | 88.3% | 91.7% | 0.3% | 0.0% | 995 | 1589 |
| Q6_K | 100.0% | 56.3% | 88.3% | 91.3% | 0.3% | 0.0% | 994 | 1615 |
| Q5_K_M | 100.0% | 52.7% | 89.0% | 92.0% | 0.3% | 0.0% | 996 | 1599 |
| Q4_K_M | 100.0% | 53.3% | 87.3% | 90.7% | 0.0% | 0.0% | 978 | 1595 |
Best schema_valid: Q5_K_M (100.0%, 857 ms mean). Tied with every other qwen quant on schema validity, edges out by 0.7-1.0% on exact match.
| Quant | Schema | Exact | Amt | TxnCount | Dup% | Super% | Mean ms | P95 ms |
|---|---|---|---|---|---|---|---|---|
| F16 | 100.0% | 59.0% | 91.0% | 93.7% | 0.0% | 0.0% | 851 | 1373 |
| Q8_0 | 100.0% | 59.3% | 91.3% | 94.0% | 0.0% | 0.0% | 851 | 1460 |
| Q6_K | 100.0% | 59.0% | 92.0% | 93.7% | 0.0% | 0.0% | 876 | 1419 |
| Q5_K_M | 100.0% | 60.0% | 91.3% | 93.3% | 0.0% | 0.0% | 857 | 1415 |
| Q4_K_M | 100.0% | 60.0% | 90.7% | 93.7% | 0.0% | 0.0% | 885 | 1373 |
Quantization barely moves the needle for qwen — F16 → Q4_K_M loses only 0.3% amount accuracy while shrinking 3×. Safe to ship Q4_K_M.
Full JSON results in eval_results/REPORT.json; per-example predictions
in eval_results/<model>-<quant>.jsonl. Regenerate any time via
python scripts/eval_all_quants.py (re-uses cached evals — only re-runs
ones with missing or --force'd results).
For shipping inside an Android app via llama.cpp JNI bindings, the choice
between the three published models comes down to disk budget, RAM ceiling,
and the kind of phone you're targeting:
| Phone class | Recommended | Disk | RAM at runtime | Why |
|---|---|---|---|---|
| Flagship (≥6 GB RAM, Adreno 7xx / Mali G715) | qwen3-0.6b-Q4_K_M |
397 MB | ~600 MB | 60% exact, ~850 ms; GPU offload via Vulkan available |
| Mid-range (4-6 GB RAM, Adreno 6xx / Mali G610) | smollm2-360m-Q4_K_M |
271 MB | ~400 MB | 100% schema, ~1 s, small enough for any CPU/GPU |
| Budget / older (≤4 GB RAM) | gemma-3-270m-Q5_K_M |
260 MB | ~380 MB | smallest viable build that holds 99.7% schema valid |
Avoid gemma-3-270m-Q4_K_M on mobile — its 93% schema-valid means roughly
1 in 14 outputs are unparseable, and the 15 s P95 latency from grammar
backtracking is unacceptable on a phone.
These knobs apply to any of the three models. The grammar file is the non-negotiable one — without it, even a regressed model emits valid JSON at ~100% but with grammar you get a mathematical guarantee.
| Param | Recommended | Why |
|---|---|---|
n_ctx |
1024 | Our prompts top out at ~600 tokens with grammar; larger ctx wastes ~50 MB per 512 extra tokens |
n_batch |
256 | Prompt-processing chunk; smaller = less RAM spike at TTFT |
n_ubatch |
256 | Physical batch size; keep equal to n_batch |
n_threads |
4 | Use the efficient cores on big.LITTLE. More threads = oversubscription = worse latency AND battery |
n_threads_batch |
4 | Same threads for prompt processing |
n_gpu_layers |
-1 (Adreno 7xx+) / 0 (older) | Vulkan offload; check llama_supports_gpu_offload() at startup and fall back |
flash_attn |
true | If your llama.cpp build supports it — halves KV-cache memory |
use_mmap |
true | Map the GGUF; don't load it all into RAM. Default in llama.cpp |
use_mlock |
false | Don't pin pages on mobile; let the OS evict if needed |
temperature |
0.0 | Deterministic — same input always produces same JSON |
top_k |
1 | Redundant at temp=0, but free safety net |
repeat_penalty |
1.0 | Disable — saves a few µs per token |
grammar |
load from scripts/grammar.py GBNF |
REQUIRED — guarantees parseable JSON. Bake the GBNF into the APK as a raw resource |
seed |
42 (or any fixed) | Reproducibility for debugging |
- Load the model once at app start, hold the
Llamahandle. Each cold load is ~300-600 ms (Q4) and reads the whole GGUF from flash — don't do it per request. - Pin to efficient cores. On Snapdragon 8 Gen 2+, set CPU affinity to the Cortex-A510 / A520 cluster. P-cores are 2× faster but burn 4× the battery for ~5-10% latency improvement on a 270M-600M model — bad trade.
- Don't pre-warm with a long prompt. First
decode()is slow because CUDA/Vulkan kernels JIT; warm with one short throwaway request at app start (cheaper than warming during a real user request). - Cap output tokens. Set
max_tokens=256— our outputs are 50-150 tokens; the cap prevents runaway generation if the grammar somehow fails. - Foreground service or WorkManager. Inference takes ~1 s; UI thread is a no-go. Use a coroutine on
Dispatchers.Defaultwith the result dispatched back to main.
The model was trained with one specific system prompt and the base
model's chat template. Use it exactly — don't paraphrase. Pull from the
model's README on Hugging Face, or copy from
scripts/_lib.py (SYSTEM_PROMPT constant). Skipping
the chat template or changing the prompt text degrades quality sharply.
// llama.cpp Android binding (any wrapper that exposes the C++ API)
val ctx = LlamaContext.builder()
.model("/data/data/<your.app>/files/txn-parser-smollm2-360m-Q4_K_M.gguf")
.nCtx(1024).nBatch(256)
.nThreads(4).nThreadsBatch(4)
.nGpuLayers(if (vulkanSupported()) -1 else 0)
.flashAttn(true).useMmap(true).useMlock(false)
.build()
val grammar = assets.open("transaction.gbnf").bufferedReader().readText()
fun extract(userText: String): String = ctx.chatCompletion(
messages = listOf(
ChatMessage("system", SYSTEM_PROMPT), // from _lib.py / HF model README
ChatMessage("user", userText),
),
temperature = 0f, topK = 1, maxTokens = 256,
grammar = grammar,
)(Exact API surface depends on which Android binding you use — Maid, mlc, or a hand-rolled JNI wrapper. The param values transfer 1:1.)
One-shot setup — installs everything (torch cu128, training deps, CUDA-built llama-cpp-python, node deps) and pulls the trained models from Hugging Face:
conda create -n llm-training python=3.11 -y && conda activate llm-training
bash setup.sh # full setup
# bash setup.sh --no-models # skip HF download
# bash setup.sh --cpu-only # skip CUDA build for llama-cpp-pythonThe script is idempotent — re-running it short-circuits already-satisfied steps. Manual setup instructions are below if you need finer control.
All trained student variants live in subfolders of
kartikey31/txn-parser on HF.
The models/ directory is not tracked in this repo — pull from HF before
running anything past Stage 2.
# Everything (all 3 base models × 5 quants, lands in models/student-<short>/)
python scripts/download_models.py
# Just one base model
python scripts/download_models.py --model gemma-3-270m
# Just one quant of every model (fastest sanity check)
python scripts/download_models.py --quant Q4_K_M
# Skip the LoRA adapters (saves ~30 MB per model)
python scripts/download_models.py --no-adaptersThe script mirrors the HF subfolder layout into
models/student-<short>/{adapters,gguf,README.md} so the rest of the
pipeline (eval_all_quants.py, predict_one.py, the viewer) finds
everything in the expected location. Re-running is idempotent — already-
present files are skipped.
pip install -U "huggingface_hub[cli]" hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1 # PowerShell: $env:HF_HUB_ENABLE_HF_TRANSFER = "1"
# All models (mirrors the full repo)
huggingface-cli download kartikey31/txn-parser --repo-type=model --local-dir models
# Just one quant of one model
huggingface-cli download kartikey31/txn-parser \
smollm2-360m/gguf/txn-parser-smollm2-360m-Q4_K_M.gguf \
--local-dir .Two-stage distillation:
- Fine-tune a teacher (Gemma 4 E2B, ~5B params) on a small human-supervised dataset
(
data/clean/train.jsonl, ~3k examples). - Use the fine-tuned teacher to label a much larger synthetic dataset
(
data/distill/train.jsonl, ~30k examples), then fine-tune the student (Gemma 3 270M) on those labels.
The student is the shippable model — small enough for on-device Android inference, accurate enough for the JSON-output task because the teacher did the heavy lifting of demonstrating the right structure across many phrasings.
# Using the recommended on-device model (smollm2-360m Q4_K_M, 271 MB)
python scripts/predict_one.py \
--model models/student-smollm2-360m/gguf/txn-parser-smollm2-360m-Q4_K_M.gguf \
"500 rs on beer 50 rs on candy"
# Or any other variant — see `Published student models` above
python scripts/predict_one.py \
--model models/student-gemma-3-270m/gguf/txn-parser-gemma-3-270m-Q5_K_M.gguf \
"do sau rupay ka chai"| Required for | |
|---|---|
| Python 3.11 | All stages |
| Node.js 20+ | Stages 2, 7 |
| NVIDIA GPU, 12+ GB VRAM | Stages 3, 5, 6 (training + teacher inference) |
| CUDA 12.8 toolkit (required for Blackwell sm_120) | Stages 3, 5, 6 |
| C/C++ build tools (MSVC, CMake) | Stage 3 GGUF export, Stage 7 llama-cpp-python |
| DeepSeek API key | Stages 1, 5 |
We use conda for the Python environment (handles CUDA-aware torch cleanly on
Windows) but install pipeline packages with pip since Unsloth and
llama-cpp-python ship the freshest wheels there.
# 1. Create + activate a dedicated env with Python 3.11
conda create -n llm-training python=3.11 -y
conda activate llm-training
python -m pip install --upgrade pip
# 2. Base deps (Stages 1, 2 backend, 4)
pip install -r requirements.txt
# 3. Training deps (Stages 3, 5, 6).
# RTX 50-series / Blackwell is sm_120 — REQUIRES torch >= 2.6 built against CUDA 12.8.
# torch 2.5 / cu124 silently lacks sm_120 kernels; do not use it on Blackwell.
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements-train.txt
# 4. Inference deps (Stages 4, 7) — build llama-cpp-python with CUDA support.
# Requires MSVC Build Tools (or gcc on Linux) + CUDA toolkit 12.8 on PATH.
$env:CMAKE_ARGS = "-DGGML_CUDA=on" # PowerShell — on bash: export CMAKE_ARGS="-DGGML_CUDA=on"
pip install -r requirements-eval.txt --no-cache-dir
# 5. Node env (Stages 2 + 7)
cd viewer
npm install
cd ..Each new shell needs conda activate llm-training before running any script.
Verify the GPU is visible:
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else '')"Why cu128, not cu124. Blackwell (sm_120, RTX 50-series) needs CUDA 12.8+ and torch ≥ 2.6. The older cu124 wheels predate sm_120 and will either crash with "no kernel image" or silently fall back to ptxas JIT (very slow).
Tests for the amount parser and validator use pytest:
pip install -r requirements-dev.txt
pytest tests/ --cov=amount_parser --cov=validator| Variable | Used by | Notes |
|---|---|---|
DEEPSEEK_API_KEY |
Stages 1, 5 | Required for any DeepSeek API call |
PYTHON_BIN |
Stage 7 | Python executable for the spawned inference worker. Default: python on Windows, python3 elsewhere. Set if your conda python isn't on PATH. |
LLAMA_N_GPU_LAYERS |
Stage 7 | Layers to offload to GPU in llama-cpp-python. Default: all (set to 0 to force CPU). |
INFER_TIMEOUT_MS |
Stage 7 | Per-request timeout for inference. Default: 120000 (120 s). |
PORT |
Stages 2, 7 | Viewer/playground HTTP port. Default: 3000. |
Set per-shell:
$env:DEEPSEEK_API_KEY = "sk-..."| Stage | Script / Service | Inputs | Outputs |
|---|---|---|---|
| 1a | scripts/01_generate_dataset.py |
data_gen_prompt.md |
data/raw/batch_NN.jsonl |
| 1b | scripts/02_clean_dataset.py |
data/raw/*.jsonl |
data/clean/{train,eval}.jsonl |
| 2 | viewer/server.js |
data/clean/, data/flags.json |
Browser viewer at :3000 |
| 3 | scripts/03_train_teacher.py |
data/clean/{train,eval}.jsonl |
models/teacher/{adapters,gguf}/ |
| 4 | scripts/04_eval.py |
A model + data/clean/eval.jsonl |
eval_results/<name>.jsonl |
| 5 | scripts/05_generate_distillation_data.py |
Trained teacher | data/distill/train.jsonl |
| 6 | scripts/06_train_student.py |
data/distill/train.jsonl |
models/student/{adapters,gguf}/ |
| 7 | viewer/server.js /playground |
Teacher + student GGUFs | Browser playground at :3000/playground |
Every script accepts --help and writes a timestamped log under logs/. Re-running any script is safe — outputs that already exist are skipped (pass --force to overwrite).
Generate 25 batches via DeepSeek, then validate, deduplicate, and split.
$env:DEEPSEEK_API_KEY = "sk-..."
# 1a: 25 batches into data/raw/ (idempotent — skips existing batches)
python scripts/01_generate_dataset.py
# 1b: validate + dedup + 90/10 split into data/clean/
python scripts/02_clean_dataset.pyUseful flags:
01_generate_dataset.py --batches 5— run only the first 5 batches (smoke test).01_generate_dataset.py --model deepseek-reasoner— swap the DeepSeek model ID.01_generate_dataset.py --force— regenerate a batch even if its file exists.02_clean_dataset.py --eval-frac 0.1 --seed 42— adjust the split.
Inspect data/clean/*.jsonl side-by-side, search, and flag bad examples.
cd viewer
npm start
# → http://localhost:3000The viewer hot-reads the JSONL files (mtime cache), so re-running Stage 1 doesn't need a server restart. Flags are appended to data/flags.json as an audit log.
Endpoints (if you want to script around it):
GET /api/examples?file=train|eval&page=N&pageSize=20&q=…GET /api/stats?file=train|evalPOST /api/flag—{file, index, reason}
QLoRA via Unsloth on unsloth/gemma-4-E2B-it.
python scripts/03_train_teacher.pySpec defaults are baked in: LoRA r=16/α=32, all linear targets, 3 epochs, per-device batch=4 × grad-accum=4, cosine 2e-4, warmup 0.03, bf16, eval every 50 steps, adamw_8bit.
Outputs:
models/teacher/adapters/— LoRA adaptermodels/teacher/gguf/— merged Q3_K_M GGUFmodels/teacher/checkpoints/— trainer checkpoints (last 2 kept)
Useful flags:
--skip-gguf— train and save adapter, skip the slow GGUF export.--resume— continue from the latest checkpoint.--force— retrain even if the adapter already exists.--max-steps 20— smoke-test the whole pipeline on a handful of steps.--max-seq-length 2048— if your inputs+outputs get long.--batch-size N --grad-accum M— 5060 Ti 16GB:4 × 4. A100 80GB: try16 × 1or32 × 1. Effective batch =N × M.
Reusable eval script — runs against any model path and writes per-example results.
# Teacher GGUF (after Stage 3)
python scripts/04_eval.py --model models/teacher/gguf
# Teacher fp16 via the LoRA adapter (better quality, used in Stage 5)
python scripts/04_eval.py --model models/teacher/adapters --name teacher-fp16
# Student (after Stage 6)
python scripts/04_eval.py --model models/student/gguf
# Smoke test on the first 50 examples
python scripts/04_eval.py --model models/teacher/gguf --limit 50
# A100 80GB — push the adapter eval batch high
python scripts/04_eval.py --model models/teacher/adapters --batch-size 32Reports % JSON-valid, % schema-valid, % exact match, and a confusion matrix for category. Per-example results land in eval_results/<model_name>.jsonl.
Stage 4 also reports four validator-derived aggregates: amount_exact (predicted
amounts equal expected as multisets), txn_count_exact, duplicate_rate (fraction
of examples with duplicate_transactions_found), and superseded_amount_used_rate.
Per-example rows in eval_results/<name>.jsonl carry the same fields plus
validation_errors[] for failed rows.
Useful flags:
--batch-size N— examples per forward pass for the transformers/adapter backend. Default 16. A100 80GB: try 32-64. 5060 Ti 16GB: 8-16. The GGUF backend ignores this —llama.cppdoesn't natively batch chat completions.--max-tokens N— generation cap per example (default 512).--n-gpu-layers N/--ngl— GGUF only;-1= all (default),0= CPU.--n-ctx N— context window (default 2048).--no-grammar— GGUF only; disable GBNF grammar-constrained decoding for baseline comparisons (see below).
By default, every GGUF-based inference path (Stage 4 eval, Stage 7
playground, and scripts/predict_one.py) constrains output via a GBNF
grammar derived from the validator's enum constants
(_lib.CATEGORIES/TYPES/CURRENCIES). Pass --no-grammar to disable
for baseline comparisons. The grammar is rebuilt from _lib at every
process start, so adding a category never drifts. See scripts/grammar.py
and the design at docs/superpowers/specs/2026-05-17-grammar-constrained-decoding-design.md.
Generate 30k synthetic inputs via DeepSeek, then label each one with the fine-tuned teacher (fp16, not the quantized version).
$env:DEEPSEEK_API_KEY = "sk-..."
# Default: run all three phases end to end (inputs -> label -> copy eval)
python scripts/05_generate_distillation_data.py
# Or run a single phase
python scripts/05_generate_distillation_data.py --phase inputs --n-inputs 30000
python scripts/05_generate_distillation_data.py --phase label
python scripts/05_generate_distillation_data.py --phase eval
# Smoke test: label only 50 inputs end-to-end
python scripts/05_generate_distillation_data.py --phase label --limit 50
# A100 80GB — batch hard, ~10x faster than the old single-example loop
python scripts/05_generate_distillation_data.py --phase label \
--batch-size 32 --max-new-tokens 256Outputs:
data/distill/inputs_raw.jsonl— raw synthetic inputs from DeepSeek (Phase 1 checkpoint, resume-friendly)data/distill/train.jsonl— teacher-labeled, schema-validated examples (Phase 2)data/distill/failed.jsonl— teacher outputs that failed JSON/schema checks (for inspection)data/distill/eval.jsonl— copy ofdata/clean/eval.jsonl(Phase 3)
All three phases are idempotent. Phase 2 uses the teacher's LoRA adapter loaded in fp16 (not the quantized GGUF) — quality matters for distillation labels. Re-running picks up from existing on-disk state via input-string deduplication.
Useful flags (Phase 2):
--batch-size N— inputs per forward pass (transformers backend). Default 16. A100 80GB: 32-64. 5060 Ti 16GB: 8-16. Single biggest perf knob here — 28k labels go from ~20 hours atbatch=1to ~45 min atbatch=32on A100.--max-new-tokens N— default 384. Drop to 256 if outputs fit — saves wall time.--limit N— only label the first N pending inputs (smoke test).--retry-failed— re-attempt inputs previously written tofailed.jsonl.--backend gguf— use the teacher GGUF (Q3_K_M) viallama-cpp-pythoninstead of fp16. Lossier but useful if VRAM is tight or fp16 isn't an option. Sequential — batch_size is ignored.--ngl N/--n-gpu-layers N— GGUF backend only;-1= all layers on GPU.--no-mmap,--mlock,--n-ctx,--n-batch— passthrough tollama-cpp-python.
Validator gate (Phase 2). Phase 2 uses scripts/validator.py to gate teacher
labels: a label must pass the JSON schema AND the semantic checks (amount-in-input,
no-superseded-amount, currency hint, txn count, no unjustified duplicates). Rejected
rows land in data/distill/failed.jsonl with reason ∈ {validation_failed, json_parse_failed, teacher_error} and structured validation.errors[] carrying
machine-readable codes. The failed.jsonl shape changed in this release — delete
or archive the old file before re-running. Re-attempt only semantic-validation
failures with --retry-validation-failed.
Stage 5 has a new optional path for multi-provider input/label generation, gated behind --provider-config. Slice 1 only supports dry-run config validation; real multi-provider execution lands in Slice 2/3.
# Validate a provider config and see quota allocations
python scripts/05_generate_distillation_data.py \
--provider-config configs/test_providers.json \
--dry-run-quotaThe full config schema is documented in docs/provider_config.md. Two example configs ship:
configs/test_providers.json— fake-only, used by smoke tests.configs/example_providers.json— realistic shape with DeepSeek / Gemini / local-teacher providers.
A standalone validator probe lets you inspect any JSONL of (input, output) pairs against the semantic validator:
python scripts/probe_validator.py \
--input data/distill/train.jsonl \
--output reports/validator_probe.jsonlThe probe reports per-code failure counts (e.g., AMOUNT_NOT_IN_INPUT, SUSPICIOUS_DUPLICATE) and writes an inspectable per-row report.
Without --provider-config, Stage 5 behaves exactly as before.
Slice 2 makes --phase inputs --provider-config <path> --multi-provider actually run with real DeepSeek + Gemini API calls. Set the API keys, point at a config, and run:
export DEEPSEEK_API_KEY=sk-...
export GOOGLE_API_KEY=... # or GEMINI_API_KEY
python scripts/05_generate_distillation_data.py \
--phase inputs \
--provider-config configs/smoke_real_providers.json \
--multi-providerThe generation loop is threaded per-provider (each provider's threads field in the config), with global dedupe against the existing data/distill/inputs_raw.jsonl and resume safety (re-running picks up at the unique-input count and only generates the remaining target_inputs). New rows include provider metadata: _provider, _model, _batch_id.
Quota is a soft scheduling hint — workers stop when the global accepted-unique total reaches target_inputs, not when a per-provider quota fills. Provider distribution may drift from configured weights based on latency and duplicate rate.
For tests and CI, set DISTILL_DIR_OVERRIDE=<tmp_path> to redirect writes away from data/distill/. This is a dev/test-only knob and is not surfaced in --help.
Slice 3 makes --phase label --provider-config <path> --multi-provider actually run: real DeepSeek + Gemini label generation, a LocalTeacherProvider wrapping the fp16 Unsloth backend, validator-gated candidate scoring, and an optional repair-on-failure retry loop.
# Real API smoke (DeepSeek + Gemini)
export DEEPSEEK_API_KEY=sk-...
export GOOGLE_API_KEY=... # or GEMINI_API_KEY
python scripts/05_generate_distillation_data.py \
--phase label \
--provider-config configs/smoke_real_providers.json \
--multi-provider \
--limit 10For each input the orchestrator generates label_attempts_per_input candidates across configured providers, hard-rejects any with validator errors, scores survivors (clean validator pass is base 80; +10 count match, +5 no warnings, +5 priority bonus; max 100), and picks the highest-scoring. If all candidates fail and repair is enabled (validation.retry_invalid_with_stricter_prompt: true plus max_repair_attempts > 0), the orchestrator re-prompts the highest-priority provider with a stricter repair prompt containing the parsed amount candidates and validator failure summary.
Accepted rows in data/distill/train.jsonl carry _source: "multi_provider_label", _provider, _model, _validation_score, _attempts metadata. Failed inputs land in data/distill/failed.jsonl with reason ∈ {all_candidates_failed, repair_exhausted, provider_error} and an attempts[] array of per-provider diagnostics.
Resume-safe: re-running skips inputs already in train.jsonl and (unless --retry-failed) inputs already in failed.jsonl. Use --retry-validation-failed to re-attempt only rows whose attempts include a validator-class failure.
--phase eval is legacy-only — omit --multi-provider for eval. --phase all exits via parser.error.
For local-teacher labeling (requires GPU + trained adapter at models/teacher/adapters), use configs/smoke_local_teacher_providers.json.
Every multi-provider Stage 5 run emits data/distill/metrics.json with per-
provider call counts, accepted/rejected candidates, latency (p50/p95), token
counts, and estimated USD cost. A compact summary is logged at the end of the
run:
=== Stage 5 metrics (phase=label, 100 inputs, 2m41s) ===
Calls: 213 (198 ok, 15 failed) Accepted: 87 Failed rows: 13
Estimated cost: $0.0342 (rates from configs/prices.json)
Provider Calls Ok Fail% Acc Cost p50 p95 Top failure
gemini_flash 108 105 2.78% 52 $0.0283 812 1421 validation_failed (7)
deepseek_v4_pro 105 93 11.43% 35 $0.0059 1104 2210 provider_error (12)
Repair: 12 attempted, 5 accepted, 7 exhausted
Costs are estimates based on configs/prices.json at run time. The script
does not fetch live pricing — update configs/prices.json to match current
provider rates if you care about USD accuracy.
Metrics fire only for --multi-provider --phase {inputs,label}. Legacy
single-provider phases, --phase eval, --phase all, and --dry-run-quota
produce no metrics file.
python scripts/06_train_student.py
# Useful variants
python scripts/06_train_student.py --skip-gguf # iterate without GGUF export
python scripts/06_train_student.py --skip-comparison # train only, no eval after
python scripts/06_train_student.py --resume # resume from latest checkpoint
python scripts/06_train_student.py --max-steps 20 # smoke test the whole flow
# A100 80GB — student is tiny, push the batch hard.
# Eval batch must stay small even when train is huge: Trainer materializes
# fp32 logits over Gemma 3's 256k vocab, and a big eval batch OOMs at step
# `eval_steps` regardless of how much memory the train pass uses.
python scripts/06_train_student.py --batch-size 128 --grad-accum 1 --eval-batch-size 8Shares the training loop with Stage 3 (scripts/_training.py). Student-specific defaults:
- Base model:
unsloth/gemma-3-270m-it(override with--model) - Training data:
data/distill/train.jsonl(teacher-labeled, ~30k examples) - LoRA r=32 / α=64 (higher capacity than teacher since the base is much smaller)
- 2 epochs (more data, fewer epochs to avoid overfit)
- Batch 8 × grad-accum 2 (the smaller model fits a larger batch — bump
--batch-sizeon a bigger GPU) - GGUF quant: Q4_K_M — this is the file that ships to Android
Outputs:
models/student/adapters/models/student/gguf/— Q4_K_M, ~270 MBmodels/student/checkpoints/
After training, the script auto-runs scripts/04_eval.py against both teacher
and student GGUFs (on data/clean/eval.jsonl) and prints a side-by-side table
of JSON-valid / schema-valid / exact-match / mean-latency with Δ columns.
Per-model eval details land in eval_results/<name>.jsonl.
Same Express server as Stage 2, with an extra /playground page and a pair of long-running Python inference workers (one per model). Each worker holds the GGUF in memory via llama-cpp-python, so requests are warm.
Important: activate the conda env in the same shell you run
npm startfrom. The server spawns Python subprocesses withPYTHON_BIN(defaultpython), and those need to findllama-cpp-pythonplus thescripts/_lib.pymodule.
conda activate llm-training # MUST be active in this shell
cd viewer
npm start
# → http://localhost:3000/playgroundOn startup the server scans models/teacher/gguf/ and models/student/gguf/. If a directory is missing or has no .gguf, that worker is skipped and the corresponding option in the UI is disabled — so the playground works even if only one model is trained.
Pick teacher / student / both, paste an input, hit Run (or Ctrl/Cmd + Enter). In "both" mode the two outputs render side-by-side with per-line diff highlighting so divergences between teacher and student JSON jump out. Inference is deterministic (temp 0, top_p 1, default max_tokens=512). History of the last 20 inputs is kept in localStorage.
Endpoints (for scripting):
GET /api/models—{ models: { teacher: {ready, gguf}, student: {…} } }POST /api/infer—{ model: "teacher"|"student"|"both", input, max_tokens? }
Tuning env vars (see the table above): PYTHON_BIN, LLAMA_N_GPU_LAYERS, INFER_TIMEOUT_MS, PORT.
.
├── setup.sh # one-shot installer + HF model download (Linux/WSL)
├── data_gen_prompt.md # source prompt used in Stages 1 and 5
├── requirements.txt # base deps (Stages 1, 2 backend)
├── requirements-train.txt # Unsloth + training stack (Stages 3, 5, 6)
├── requirements-eval.txt # llama-cpp-python (Stages 4, 7)
├── data/
│ ├── raw/ # DeepSeek batches (Stage 1a)
│ ├── clean/ # train.jsonl, eval.jsonl (Stage 1b)
│ ├── distill/ # teacher-labeled data (Stage 5)
│ └── flags.json # bad-example audit log (Stage 2)
├── scripts/
│ ├── _lib.py # shared constants, schema, helpers
│ ├── _training.py # shared SFT + LoRA loop (Stages 3 & 6)
│ ├── 01_generate_dataset.py
│ ├── 02_clean_dataset.py
│ ├── 03_train_teacher.py
│ ├── 04_eval.py
│ ├── 05_generate_distillation_data.py
│ └── 06_train_student.py
├── viewer/
│ ├── server.js # Express server (Stages 2 + 7)
│ ├── inference_worker.py # long-running llama-cpp-python worker (Stage 7)
│ └── public/ # index.html, playground.html, app.js, style.css
├── models/ # NOT tracked — pulled from kartikey31/txn-parser on HF
│ ├── teacher/{adapters,gguf,checkpoints}/
│ └── student-<base>/{adapters,gguf,README.md}/ # one dir per published base model
├── eval_results/ # per-model evaluation outputs (Stage 4)
└── logs/ # one timestamped log per script run
| Model | JSON-valid | Schema-valid | Exact match |
|---|---|---|---|
| Teacher | > 98% | > 97% | > 85% |
| Student | > 97% | > 95% | > 80% |
If a target is missed after training, propose specific fixes (more data, different LoRA rank, more epochs) rather than declaring success.