feat: Add multi-model LLM benchmarking notebook #384
+2,441
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a Jupyter Notebook (.ipynb) that provides an end-to-end framework for evaluating and benchmarking multiple Large Language Models from various providers.
What this notebook contains:
Dynamic Client Initialization: The notebook automatically detects available API keys (OpenAI, Groq, DeepSeek) from environment variables and checks for local services (Ollama) to dynamically configure the list of competitor models.
Weighted Question Generation: Implements a two-step process to generate the evaluation questions:
Generates three distinct, nuanced questions.
Uses a "Consultant" LLM (gpt-4o-mini) to rank these questions and assign weights (50, 30, 20 points).
Parallel Query Execution: A loop iterates through all configured models and collects their responses to all three weighted questions.
Dual-Weight "Judge" Scoring:
All responses are formatted and sent to a single "Judge" LLM (gpt-4o).
The Judge provides two scores for each answer: a judge_score (60% weight) and a peer_average_score (40% weight).
Final Score Calculation: A script parses the Judge's nested JSON output, applies the answer-level (60/40) weights, and then applies the question-level (50/30/20) weights to calculate a final score for each model.
Results Visualization: Uses matplotlib to generate a horizontal bar chart of the final results, highlighting the winning model in a different color.
Dependencies:
The notebook requires the following Python packages:
openai
requests (for the Ollama health check)
matplotlib (for the final graph)
numpy
jupyter