This repo is part of the internship application for the "Evaluation of LLM-Based Agentic Systems for Software Development Tasks" project at JetBrains.
LLM‑driven Python bug fixing agent evaluation on HumanEval
Log file: /home/mcseem/PycharmProjects/PyFixik/logs/humaneval_qwen3_1.7b_20250926_112412.json
Session: humaneval_qwen3_1.7b_20250926_112412
created_utc: 2025-09-26T11:24:12Z
model: qwen3:1.7b
temperature: 0.8
k_runs: 1
Totals: tasks=164 runs=164
pass@1: 25.00%
Averages over runs:
time_elapsed_sec: 22.357
n_rounds: 6.738
tool_calls_total: 3.037
tool_calls_by_tool (avg per run):
get_buggy_code: 1.207
run_tests_on_fixed: 0.780
write_fixed_solution: 0.659
get_tests: 0.232
run_tests_on_buggy: 0.140
read_fixed_solution: 0.018
The project has the following components:
| Script | Purpose | Arguments |
|---|---|---|
run_eval.py |
Run the benchmark end-to-end | --model MODEL: Ollama model name to use--temperature TEMP: Sampling temperature for the LLM--k-runs: How many independent attempts per task to run--verbosity: The stdout verbosity |
aggregate_logs.py |
Aggregate a session log and print a summary with metrics | --file, -f FILE: Path to aggregated json log.--latest: Pick the most recent automatically |
eval.py- Dataset loader and evaluator loop. Seedsworkspace/, runs the agent, executes tests in sandbox, and logs results.agent.py- Builds a LangGraph ReAct agent backed by langchain‑ollama ChatOllama and wired with the tools defined below.tools.py- Tool functions the agent can call.sandbox.py- Minimal, Docker‑based Python sandbox (no network, CPU/mem/PID limits) used to run code and tests safely.logger.py- Writes per‑runjsonland keeps an up‑to‑date aggregatedjsonfile with session metadata and all runs.workspace/- Scratch directory used by the tools:buggy.py— seeded with the buggy function for the current tasktests.py- unit tests for the taskfixed.py- the agent’s proposed fix
Python (via requirements.txt):
langchain-ollama
langgraph~=0.6.7
langchain-core~=0.3.76
datasets~=4.1.1
tqdm~=4.66.5
Non‑Python:
- Docker (required for sandboxed execution)
- Ollama (local LLM runtime), with a suitable model pulled (default used in examples:
qwen3:1.7b)
Install system deps
- Install Docker and ensure your user can run it (
docker run hello-world). - Install Ollama and pull a model, e.g.:
ollama pull qwen3:1.7b.
Create a Python environment and install packages
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtRun the evaluation
python run_eval.py --model qwen3:1.7b --temperature 0.8 --k-runs 1 --verbosity 0Aggregate results
python aggregate_logs.py --latest
# Or specify a file:
python aggregate_logs.py --file logs/humaneval_<...>.jsonFor each task from the dataset:
- The
workspace/subdirectory is seeded with the buggy implementation and tests from the dataset. - A ReAct agent is built, with access to the tools from
tools.py. - The agent attempts to solve the task, writing the solution to
workspace/fixed.py. - Each attempt (“run”) is logged by
logger.py. This gets repeated fork_runsper each task.
Code execution happens inside a Docker container: python:3.11-alpine:
- network disabled
- CPU/memory/PID limits
- read‑only root
/tmpastmpfs- workspace mounted read‑write at
/workspace
With the following tools, the agent is exposed to the context needed to solve the task.
get_buggy_code: readworkspace/buggy.pyget_tests: readworkspace/tests.pywrite_fixed_solution: writeworkspace/fixed.pyread_fixed_solution: readworkspace/fixed.pyrun_tests_on_buggy: execute tests againstbuggy.py(in a sandbox)run_tests_on_fixed: execute tests againstfixed.py(in a sandbox)
Regardless of the outcome of this submission, thank you for sharing this project. It was fun tinkering with LangGraph and docker to get it to work, and then shaping the module to polish it.