Skip to content

Automatically benchmark Google Gemma 3 models (e.g., gemma-3-27b-it) with correct n-shot, CoT, and decoding parameters in Google Colab with RunPod. This is a fork of mlabonne/llm-autoeval, adapted for compatibility with the Gemma 3 architecture and optimized to reproduce evaluation settings used in the official technical report.

License

Notifications You must be signed in to change notification settings

Summykai/llm-autoeval-gemma3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧐 LLM AutoEval: Gemma3 Edition

Forked from mlabonne • 🧠 Supports Gemma 3 • 🚀 Colab Notebook

Official LLM AutoEval fork, updated for compatibility with Google's Gemma 3 based models.

Open In Colab

🔍 Overview

This fork of LLM AutoEval is optimized for evaluating Google's Gemma 3 models (e.g. gemma-3-27b-it) using the exact settings described in the official technical report. It ensures reproducible benchmark results and compatibility with standard evaluation suites.

Key Enhancements

  • Hyperparameter tuning for Gemma 3: temperature=1.0, top_p=0.95, top_k=64
  • Benchmarks configured to match Google's evaluation (e.g. gsm8k 8-shot, CoT)
  • Compatibility updates for new tokenizer, prompts, and padding behavior
  • Colab + RunPod-friendly setup
  • Option to enable/disable 4bit quantization
  • Option to set limits on evaluation for quick results when experimenting

⚡ Quick Start

Evaluation

  • MODEL_ID: Enter the model ID from Hugging Face (e.g. google/gemma-3-27b-it or a compatible fork).
  • BENCHMARK:
    • nous: AGIEval, GPT4ALL, TruthfulQA, Bigbench (Teknium / Nous-style benchmark sweep)
    • openllm: ARC, HellaSwag, MMLU, Winogrande, GSM8K, TruthfulQA (Open LLM Leaderboard set)
    • lighteval: Hugging Face's task-level evaluator (e.g. HELM, PIQA, MATH, GSM8K). Use LIGHTEVAL_TASK to specify tasks.
  • LIGHTEVAL_TASK: Comma-separated list of task names (see recommended tasks)

Benchmarks Implemented in runpod.sh

Each benchmark triggers specific task suites:

  • nous:

    • agieval_*
    • hellaswag, openbookqa, winogrande, arc_easy, arc_challenge, boolq, piqa
    • truthfulqa_mc
    • bigbench_*
  • openllm:

    • arc_challenge (25-shot)
    • hellaswag (10-shot, Char-Len norm)
    • mmlu (5-shot, Char-Len norm)
    • winogrande (5-shot, Accuracy)
    • gsm8k (updated to 8-shot + CoT)
    • truthfulqa
  • lighteval: Any HF-compatible task with --use_chat_template and custom LIGHTEVAL_TASK

☁️ Cloud GPU Setup

  • GPU: Pick high-VRAM GPUs (e.g. A100 80GB, RTX 6000 Ada) for Gemma 27B
  • Number of GPUs: Multi-GPU supported with accelerate
  • REPO: Set to your fork of this repo (the container executes runpod.sh)
  • DEBUG: Keep the pod alive for manual inspection

🧪 Hyperparameters (for Gemma3)

All generations use:

  • temperature: 1.0
  • top_p: 0.95
  • top_k: 64
  • do_sample: True (when applicable)
  • generation_config passed through gen_kwargs

GSM8K uses:

  • 8-shot
  • CoT prompting
  • Evaluation via exact_match

🛠️ Troubleshooting

  • "700 Killed": Use larger GPU (e.g. A100)
  • "triu_tril_cuda_template": See issue #22
  • "File not found": Use DEBUG=true and check logs

📈 Benchmarks + Leaderboards

You can compare results with:

🙏 Acknowledgements

This fork is based on mlabonne/llm-autoeval and includes contributions from:

🧪 Colab Notebook

Open in Colab

Evaluate your Gemma 3 fork now →

MODEL_ID=your-model BENCHMARK=openllm bash runpod.sh

About

Automatically benchmark Google Gemma 3 models (e.g., gemma-3-27b-it) with correct n-shot, CoT, and decoding parameters in Google Colab with RunPod. This is a fork of mlabonne/llm-autoeval, adapted for compatibility with the Gemma 3 architecture and optimized to reproduce evaluation settings used in the official technical report.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.2%
  • Shell 25.8%