Skip to content

Mc-Seem/PyFixik

Repository files navigation

This repo is part of the internship application for the "Evaluation of LLM-Based Agentic Systems for Software Development Tasks" project at JetBrains.

PyFixik

LLM‑driven Python bug fixing agent evaluation on HumanEval

Results

Log file: /home/mcseem/PycharmProjects/PyFixik/logs/humaneval_qwen3_1.7b_20250926_112412.json
Session: humaneval_qwen3_1.7b_20250926_112412
  created_utc: 2025-09-26T11:24:12Z
  model: qwen3:1.7b
  temperature: 0.8
  k_runs: 1

Totals: tasks=164 runs=164
pass@1: 25.00%
Averages over runs:
  time_elapsed_sec: 22.357
  n_rounds: 6.738
  tool_calls_total: 3.037
  tool_calls_by_tool (avg per run):
    get_buggy_code: 1.207
    run_tests_on_fixed: 0.780
    write_fixed_solution: 0.659
    get_tests: 0.232
    run_tests_on_buggy: 0.140
    read_fixed_solution: 0.018

Project structure

The project has the following components:

CLI User Scripts

Script Purpose Arguments
run_eval.py Run the benchmark end-to-end --model MODEL: Ollama model name to use
--temperature TEMP: Sampling temperature for the LLM
--k-runs: How many independent attempts per task to run
--verbosity: The stdout verbosity
aggregate_logs.py Aggregate a session log and print a summary with metrics --file, -f FILE: Path to aggregated json log.
--latest: Pick the most recent automatically

Agent and Evaluation Code

  • eval.py - Dataset loader and evaluator loop. Seeds workspace/, runs the agent, executes tests in sandbox, and logs results.
  • agent.py - Builds a LangGraph ReAct agent backed by langchain‑ollama ChatOllama and wired with the tools defined below.
  • tools.py - Tool functions the agent can call.
  • sandbox.py - Minimal, Docker‑based Python sandbox (no network, CPU/mem/PID limits) used to run code and tests safely.
  • logger.py - Writes per‑run jsonl and keeps an up‑to‑date aggregated json file with session metadata and all runs.
  • workspace/ - Scratch directory used by the tools:
    • buggy.py — seeded with the buggy function for the current task
    • tests.py - unit tests for the task
    • fixed.py - the agent’s proposed fix

Dependencies

Python (via requirements.txt):

langchain-ollama
langgraph~=0.6.7
langchain-core~=0.3.76
datasets~=4.1.1
tqdm~=4.66.5

Non‑Python:

  • Docker (required for sandboxed execution)
  • Ollama (local LLM runtime), with a suitable model pulled (default used in examples: qwen3:1.7b)

Quick Start

Install system deps

  • Install Docker and ensure your user can run it (docker run hello-world).
  • Install Ollama and pull a model, e.g.: ollama pull qwen3:1.7b.

Create a Python environment and install packages

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Run the evaluation

python run_eval.py --model qwen3:1.7b --temperature 0.8 --k-runs 1 --verbosity 0

Aggregate results

python aggregate_logs.py --latest
# Or specify a file:
python aggregate_logs.py --file logs/humaneval_<...>.json

How it works (high-level)

Evaluation Loop

For each task from the dataset:

  • The workspace/ subdirectory is seeded with the buggy implementation and tests from the dataset.
  • A ReAct agent is built, with access to the tools from tools.py.
  • The agent attempts to solve the task, writing the solution to workspace/fixed.py.
  • Each attempt (“run”) is logged by logger.py. This gets repeated for k_runs per each task.

Sandboxed Execution

Code execution happens inside a Docker container: python:3.11-alpine:

  • network disabled
  • CPU/memory/PID limits
  • read‑only root
  • /tmp as tmpfs
  • workspace mounted read‑write at /workspace

Tools

With the following tools, the agent is exposed to the context needed to solve the task.

  • get_buggy_code: read workspace/buggy.py
  • get_tests: read workspace/tests.py
  • write_fixed_solution: write workspace/fixed.py
  • read_fixed_solution: read workspace/fixed.py
  • run_tests_on_buggy: execute tests against buggy.py (in a sandbox)
  • run_tests_on_fixed: execute tests against fixed.py (in a sandbox)

Post-Work Thoughts

Regardless of the outcome of this submission, thank you for sharing this project. It was fun tinkering with LangGraph and docker to get it to work, and then shaping the module to polish it.

About

LLM‑driven Python bug fixing agent evaluation on HumanEval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages