feat: Add multi-model LLM benchmarking notebook #384

xeroxpro · 2025-11-06T00:25:40Z

This PR introduces a Jupyter Notebook (.ipynb) that provides an end-to-end framework for evaluating and benchmarking multiple Large Language Models from various providers.

What this notebook contains:
Dynamic Client Initialization: The notebook automatically detects available API keys (OpenAI, Groq, DeepSeek) from environment variables and checks for local services (Ollama) to dynamically configure the list of competitor models.

Weighted Question Generation: Implements a two-step process to generate the evaluation questions:

Generates three distinct, nuanced questions.

Uses a "Consultant" LLM (gpt-4o-mini) to rank these questions and assign weights (50, 30, 20 points).

Parallel Query Execution: A loop iterates through all configured models and collects their responses to all three weighted questions.

Dual-Weight "Judge" Scoring:

All responses are formatted and sent to a single "Judge" LLM (gpt-4o).

The Judge provides two scores for each answer: a judge_score (60% weight) and a peer_average_score (40% weight).

Final Score Calculation: A script parses the Judge's nested JSON output, applies the answer-level (60/40) weights, and then applies the question-level (50/30/20) weights to calculate a final score for each model.

Results Visualization: Uses matplotlib to generate a horizontal bar chart of the final results, highlighting the winning model in a different color.

Dependencies:
The notebook requires the following Python packages:

openai

requests (for the Ollama health check)

matplotlib (for the final graph)

numpy

jupyter

Copilot

Copilot wasn't able to review any files in this pull request.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ed-donner · 2025-11-09T04:07:17Z

Hey - this is great - but would you be OK to clear the outputs first for me? Otherwise it's 2,000+ loc

Add Xeroxat lab 2 solution (cleared outputs)

67a9cf9

Copilot AI review requested due to automatic review settings November 6, 2025 00:25

Copilot AI reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add multi-model LLM benchmarking notebook #384

feat: Add multi-model LLM benchmarking notebook #384

xeroxpro commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

ed-donner commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add multi-model LLM benchmarking notebook #384

Are you sure you want to change the base?

feat: Add multi-model LLM benchmarking notebook #384

Conversation

xeroxpro commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

ed-donner commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants