LLM-Judge evaluation

Use-cases:

evaluate one model easily against another on AE/AH/m-AH
easily swap judge model
common format for AE/AH/m-AH

For generation and LLM-judge any model available in LangChain should be usable in theory (I only tested LlamaCpp and Together, I plan to also test VLLM and OpenAI).

Generate completions. To generate completions, run something like this:

python generate.py \
--model_provider LlamaCpp \
--model_kwargs model_path=jwiggerthale_Llama-3.2-3B-Q8_0-GGUF_llama-3.2-3b-q8_0.gguf max_retries=3 \
--output_path results/llama-3.2-3b-q8_0.csv.zip \
--n_instructions 10

Evaluate a model available in AlpacaEval against local predictions. To run judge LLM evaluations, run something like this:

python llmjudgeeval/evaluate.py \
--dataset alpaca-eval \
--method_A gpt4_1106_preview \
--method_B alpaca-eval-gpt-3.5-turbo.csv.zip \
--judge_provider Together \
--judge_model meta-llama/Llama-3.3-70B-Instruct-Turbo \
--n_instructions 10

Note that the methods passed in method_A and method_B should be either a method existing in the dataset used or a local file containing instructions like alpaca-eval-gpt-3.5-turbo.csv.zip.

To run on VLLM, just change the model-provider, see other options supported (LlamaCpp, ChatOpenAI, ...).

python llmjudgeeval/evaluate.py \
--dataset alpaca-eval \
--method_A gpt4_1106_preview \
--method_B alpaca-eval-gpt-3.5-turbo.csv.zip \
--judge_provider VLLM \
--judge_model meta-llama/Meta-Llama-3-8B-instruct \
--n_instructions 10

TODOs:

support m-arena-hard [high/large]
- instruction loader [DONE]
- generate instructions for two models
- make comparison
support evaluation with input swap [medium/small]
test vLLM judge [medium/small]
handle errors [medium/small]
CLI launcher [medium/large]
document options [medium/large]
add details to example to generate and evaluate completions [medium/medium]
CI [high/large]
implement CI judge option
implement domain filter in CI (maybe pass a regexp by column?)
cost?

Done:

support alpaca-eval
support arena-hard
test together judge
local env variable to set paths
tqdm callback with batch
support loading local completions
support dumping outputs [medium/small]
test LlamaCpp [medium/small]
test openai judge [medium/small]

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
llmjudgeeval		llmjudgeeval
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
alpaca-eval-gpt-3.5-turbo.csv.zip		alpaca-eval-gpt-3.5-turbo.csv.zip
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Judge evaluation

About

Uh oh!

Releases

Packages

Languages

OpenEuroLLM/llm-judge-eval

Folders and files

Latest commit

History

Repository files navigation

LLM-Judge evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages