✨ Forked from mlabonne • 🧠 Supports Gemma 3 • 🚀 Colab Notebook
Official LLM AutoEval fork, updated for compatibility with Google's Gemma 3 based models.
This fork of LLM AutoEval is optimized for evaluating Google's Gemma 3 models (e.g. gemma-3-27b-it) using the exact settings described in the official technical report. It ensures reproducible benchmark results and compatibility with standard evaluation suites.
- Hyperparameter tuning for Gemma 3:
temperature=1.0,top_p=0.95,top_k=64 - Benchmarks configured to match Google's evaluation (e.g.
gsm8k8-shot, CoT) - Compatibility updates for new tokenizer, prompts, and padding behavior
- Colab + RunPod-friendly setup
- Option to enable/disable 4bit quantization
- Option to set limits on evaluation for quick results when experimenting
MODEL_ID: Enter the model ID from Hugging Face (e.g.google/gemma-3-27b-itor a compatible fork).BENCHMARK:nous: AGIEval, GPT4ALL, TruthfulQA, Bigbench (Teknium / Nous-style benchmark sweep)openllm: ARC, HellaSwag, MMLU, Winogrande, GSM8K, TruthfulQA (Open LLM Leaderboard set)lighteval: Hugging Face's task-level evaluator (e.g. HELM, PIQA, MATH, GSM8K). UseLIGHTEVAL_TASKto specify tasks.
LIGHTEVAL_TASK: Comma-separated list of task names (see recommended tasks)
Each benchmark triggers specific task suites:
-
✅
nous:agieval_*hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqatruthfulqa_mcbigbench_*
-
✅
openllm:arc_challenge(25-shot)hellaswag(10-shot, Char-Len norm)mmlu(5-shot, Char-Len norm)winogrande(5-shot, Accuracy)gsm8k(updated to 8-shot + CoT)truthfulqa
-
✅
lighteval: Any HF-compatible task with--use_chat_templateand customLIGHTEVAL_TASK
GPU: Pick high-VRAM GPUs (e.g. A100 80GB, RTX 6000 Ada) for Gemma 27BNumber of GPUs: Multi-GPU supported withaccelerateREPO: Set to your fork of this repo (the container executesrunpod.sh)DEBUG: Keep the pod alive for manual inspection
All generations use:
temperature: 1.0top_p: 0.95top_k: 64do_sample: True(when applicable)generation_configpassed throughgen_kwargs
GSM8K uses:
8-shotCoT prompting- Evaluation via
exact_match
- "700 Killed": Use larger GPU (e.g. A100)
- "triu_tril_cuda_template": See issue #22
- "File not found": Use
DEBUG=trueand check logs
You can compare results with:
This fork is based on mlabonne/llm-autoeval and includes contributions from:
- dmahan93 – AGIEval integration
- burtenshaw – LightEval support
- Hugging Face –
lighteval+transformers - EleutherAI – Core harness
- Teknium, NousResearch – Benchmarks
- Google – Gemma 3 LLMs + Eval configs
Evaluate your Gemma 3 fork now →
MODEL_ID=your-model BENCHMARK=openllm bash runpod.sh