Skip to content

Conversation

@xeroxpro
Copy link

@xeroxpro xeroxpro commented Nov 6, 2025

This PR introduces a Jupyter Notebook (.ipynb) that provides an end-to-end framework for evaluating and benchmarking multiple Large Language Models from various providers.

What this notebook contains:
Dynamic Client Initialization: The notebook automatically detects available API keys (OpenAI, Groq, DeepSeek) from environment variables and checks for local services (Ollama) to dynamically configure the list of competitor models.

Weighted Question Generation: Implements a two-step process to generate the evaluation questions:

Generates three distinct, nuanced questions.

Uses a "Consultant" LLM (gpt-4o-mini) to rank these questions and assign weights (50, 30, 20 points).

Parallel Query Execution: A loop iterates through all configured models and collects their responses to all three weighted questions.

Dual-Weight "Judge" Scoring:

All responses are formatted and sent to a single "Judge" LLM (gpt-4o).

The Judge provides two scores for each answer: a judge_score (60% weight) and a peer_average_score (40% weight).

Final Score Calculation: A script parses the Judge's nested JSON output, applies the answer-level (60/40) weights, and then applies the question-level (50/30/20) weights to calculate a final score for each model.

Results Visualization: Uses matplotlib to generate a horizontal bar chart of the final results, highlighting the winning model in a different color.

Dependencies:
The notebook requires the following Python packages:

openai

requests (for the Ollama health check)

matplotlib (for the final graph)

numpy

jupyter

Copilot AI review requested due to automatic review settings November 6, 2025 00:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ed-donner
Copy link
Owner

Hey - this is great - but would you be OK to clear the outputs first for me? Otherwise it's 2,000+ loc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants