🧐 LLM AutoEval: Gemma3 Edition

✨ Forked from mlabonne • 🧠 Supports Gemma 3 • 🚀 Colab Notebook

Official LLM AutoEval fork, updated for compatibility with Google's Gemma 3 based models.

🔍 Overview

This fork of LLM AutoEval is optimized for evaluating Google's Gemma 3 models (e.g. gemma-3-27b-it) using the exact settings described in the official technical report. It ensures reproducible benchmark results and compatibility with standard evaluation suites.

Key Enhancements

Hyperparameter tuning for Gemma 3: temperature=1.0, top_p=0.95, top_k=64
Benchmarks configured to match Google's evaluation (e.g. gsm8k 8-shot, CoT)
Compatibility updates for new tokenizer, prompts, and padding behavior
Colab + RunPod-friendly setup
Option to enable/disable 4bit quantization
Option to set limits on evaluation for quick results when experimenting

⚡ Quick Start

Evaluation

MODEL_ID: Enter the model ID from Hugging Face (e.g. google/gemma-3-27b-it or a compatible fork).
BENCHMARK:
- nous: AGIEval, GPT4ALL, TruthfulQA, Bigbench (Teknium / Nous-style benchmark sweep)
- openllm: ARC, HellaSwag, MMLU, Winogrande, GSM8K, TruthfulQA (Open LLM Leaderboard set)
- lighteval: Hugging Face's task-level evaluator (e.g. HELM, PIQA, MATH, GSM8K). Use LIGHTEVAL_TASK to specify tasks.
LIGHTEVAL_TASK: Comma-separated list of task names (see recommended tasks)

Benchmarks Implemented in `runpod.sh`

Each benchmark triggers specific task suites:

✅ nous:
- agieval_*
- hellaswag, openbookqa, winogrande, arc_easy, arc_challenge, boolq, piqa
- truthfulqa_mc
- bigbench_*
✅ openllm:
- arc_challenge (25-shot)
- hellaswag (10-shot, Char-Len norm)
- mmlu (5-shot, Char-Len norm)
- winogrande (5-shot, Accuracy)
- gsm8k (updated to 8-shot + CoT)
- truthfulqa
✅ lighteval: Any HF-compatible task with --use_chat_template and custom LIGHTEVAL_TASK

☁️ Cloud GPU Setup

GPU: Pick high-VRAM GPUs (e.g. A100 80GB, RTX 6000 Ada) for Gemma 27B
Number of GPUs: Multi-GPU supported with accelerate
REPO: Set to your fork of this repo (the container executes runpod.sh)
DEBUG: Keep the pod alive for manual inspection

🧪 Hyperparameters (for Gemma3)

All generations use:

temperature: 1.0
top_p: 0.95
top_k: 64
do_sample: True (when applicable)
generation_config passed through gen_kwargs

GSM8K uses:

8-shot
CoT prompting
Evaluation via exact_match

🛠️ Troubleshooting

"700 Killed": Use larger GPU (e.g. A100)
"triu_tril_cuda_template": See issue #22
"File not found": Use DEBUG=true and check logs

📈 Benchmarks + Leaderboards

You can compare results with:

🙏 Acknowledgements

This fork is based on mlabonne/llm-autoeval and includes contributions from:

dmahan93 – AGIEval integration
burtenshaw – LightEval support
Hugging Face – lighteval + transformers
EleutherAI – Core harness
Teknium, NousResearch – Benchmarks
Google – Gemma 3 LLMs + Eval configs

🧪 Colab Notebook

Open in Colab

Evaluate your Gemma 3 fork now →

MODEL_ID=your-model BENCHMARK=openllm bash runpod.sh

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
img		img
llm_autoeval		llm_autoeval
LICENSE		LICENSE
README.md		README.md
main.py		main.py
runpod.sh		runpod.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧐 LLM AutoEval: Gemma3 Edition

🔍 Overview

Key Enhancements

⚡ Quick Start

Evaluation

Benchmarks Implemented in `runpod.sh`

☁️ Cloud GPU Setup

🧪 Hyperparameters (for Gemma3)

🛠️ Troubleshooting

📈 Benchmarks + Leaderboards

🙏 Acknowledgements

🧪 Colab Notebook

About

Uh oh!

Releases

Packages

Languages

License

Summykai/llm-autoeval-gemma3

Folders and files

Latest commit

History

Repository files navigation

🧐 LLM AutoEval: Gemma3 Edition

🔍 Overview

Key Enhancements

⚡ Quick Start

Evaluation

Benchmarks Implemented in runpod.sh

☁️ Cloud GPU Setup

🧪 Hyperparameters (for Gemma3)

🛠️ Troubleshooting

📈 Benchmarks + Leaderboards

🙏 Acknowledgements

🧪 Colab Notebook

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Benchmarks Implemented in `runpod.sh`

Packages