llm-as-a-judge

Here are 18 public repositories matching this topic...

Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place.

prompt-engineering prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability llmops-platform

Updated Feb 12, 2025
Python

prometheus-eval / prometheus-eval

Star

Evaluate your LLM's response with Prometheus and GPT4 💯

python evaluation gpt4 llm llmops vllm litellm llm-as-a-judge llm-as-evaluator

Updated Jan 7, 2025
Python

metauto-ai / agent-as-a-judge

Star

🤠 Agent-as-a-Judge and DevAI dataset

llms llm-as-a-judge agent-as-a-judge

Updated Jan 20, 2025
Python

IAAR-Shanghai / xFinder

Star

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

benchmark regex reliability evaluation dataset gpt phi large-language-models llm open-compass chatglm qwen lm-evaluation llm-as-a-judge llm-as-evaluator xfinder reliable-evaluation key-answer-extraction judge-model

Updated Jan 23, 2025
Python

martin-wey / CodeUltraFeedback

Star

CodeUltraFeedback: aligning large language models to coding preferences

alignment code-generation dpo large-language-models llm-as-a-judge codeultrafeedback codal-bench

Updated Jun 25, 2024
Python

KID-22 / LLM-IR-Bias-Fairness-Survey

Star

This is the repo for the survey of Bias and Fairness in IR with LLMs.

information-retrieval recommender-systems bias ir fairness large-language-models llm chatgpt llm4rec llm4rs llm-as-a-judge llm-as-evaluator llm4ir

Updated Oct 29, 2024

MJ-Bench / MJ-Bench

Star

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

reward-models multimodal-foundation-model llm-benchmarking llm-as-a-judge multimodal-judge

Updated Nov 19, 2024
Jupyter Notebook

zhaochen0110 / Timo

Star

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

temporal-reasoning sota-model llms rlhf rlaif llm-as-a-judge llm-as-evaluator self-critic-framework colm2024

Updated Oct 23, 2024
Python

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

UMass-Meta-LLM-Eval / llm_eval

Star

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

nlp-machine-learning large-language-models llm-as-a-judge

Updated Oct 1, 2024
Python

aws-samples / genai-system-evaluation

Star

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

information-retrieval generative-ai genai llm-evaluation llm-as-a-judge

Updated Sep 5, 2024
Jupyter Notebook

HillPhelmuth / LlmAsJudgeEvalPlugins

Star

LLM-as-judge evals as Semantic Kernel Plugins

semantickernel llm-evaluation llm-as-a-judge llm-as-evaluator

Updated Jan 8, 2025
C#

aws-samples / model-as-a-judge-eval

Star

Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.

evaluation llm generative-ai llm-as-a-judge

Updated May 31, 2024
Jupyter Notebook

emory-irlab / conqret-rag

Star

Controversial Questions for Argumentation and Retrieval

retrieval argumentation hallucination retrieval-augmented-generation llm-as-a-judge

Updated Dec 9, 2024
Python

djokester / groqeval

Star

Use groq for evaluations

groq llm generative-ai mixtral llm-as-a-judge llm-as-evaluator llama3

Updated Jul 11, 2024
Python

romaingrx / llm-as-a-jailbreak-judge

Star

Explore techniques to use small models as jailbreaking judges

jailbreak aisafety llm-as-a-judge

Updated Sep 18, 2024
Python

OussamaSghaier / CuREV

Star

Harnessing Large Language Models for Curated Code Reviews

code-review software-maintenance empirical-software-engineering large-language-models dataset-curation llm-as-a-judge

Updated Feb 12, 2025
Python

rafaelsandroni / antibodies

Star

Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)

python nli hallucinations llms hallucination-detection llm-as-a-judge llm-as-evaluator

Updated Jun 13, 2024
Python

Improve this page

Add a description, image, and links to the llm-as-a-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-as-a-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-as-a-judge

Here are 18 public repositories matching this topic...

Agenta-AI / agenta

prometheus-eval / prometheus-eval

metauto-ai / agent-as-a-judge

IAAR-Shanghai / xFinder

martin-wey / CodeUltraFeedback

KID-22 / LLM-IR-Bias-Fairness-Survey

MJ-Bench / MJ-Bench

zhaochen0110 / Timo

minnesotanlp / cobbler

UMass-Meta-LLM-Eval / llm_eval

aws-samples / genai-system-evaluation

HillPhelmuth / LlmAsJudgeEvalPlugins

aws-samples / model-as-a-judge-eval

emory-irlab / conqret-rag

djokester / groqeval

romaingrx / llm-as-a-jailbreak-judge

OussamaSghaier / CuREV

rafaelsandroni / antibodies

Improve this page

Add this topic to your repo