The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place.
-
Updated
Feb 12, 2025 - Python
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place.
Evaluate your LLM's response with Prometheus and GPT4 💯
🤠 Agent-as-a-Judge and DevAI dataset
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
CodeUltraFeedback: aligning large language models to coding preferences
This is the repo for the survey of Bias and Fairness in IR with LLMs.
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
LLM-as-judge evals as Semantic Kernel Plugins
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
Controversial Questions for Argumentation and Retrieval
Use groq for evaluations
Explore techniques to use small models as jailbreaking judges
Harnessing Large Language Models for Curated Code Reviews
Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)
Add a description, image, and links to the llm-as-a-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-a-judge topic, visit your repo's landing page and select "manage topics."