PyFixik

This repo is part of the internship application for the "Evaluation of LLM-Based Agentic Systems for Software Development Tasks" project at JetBrains.

PyFixik

LLM‑driven Python bug fixing agent evaluation on HumanEval

Results

Log file: /home/mcseem/PycharmProjects/PyFixik/logs/humaneval_qwen3_1.7b_20250926_112412.json
Session: humaneval_qwen3_1.7b_20250926_112412
  created_utc: 2025-09-26T11:24:12Z
  model: qwen3:1.7b
  temperature: 0.8
  k_runs: 1

Totals: tasks=164 runs=164
pass@1: 25.00%
Averages over runs:
  time_elapsed_sec: 22.357
  n_rounds: 6.738
  tool_calls_total: 3.037
  tool_calls_by_tool (avg per run):
    get_buggy_code: 1.207
    run_tests_on_fixed: 0.780
    write_fixed_solution: 0.659
    get_tests: 0.232
    run_tests_on_buggy: 0.140
    read_fixed_solution: 0.018

Project structure

The project has the following components:

CLI User Scripts

Script	Purpose	Arguments
`run_eval.py`	Run the benchmark end-to-end	`--model MODEL`: Ollama model name to use `--temperature TEMP`: Sampling temperature for the LLM `--k-runs`: How many independent attempts per task to run `--verbosity`: The `stdout` verbosity
`aggregate_logs.py`	Aggregate a session log and print a summary with metrics	`--file, -f FILE`: Path to aggregated `json` log. `--latest`: Pick the most recent automatically

Agent and Evaluation Code

eval.py - Dataset loader and evaluator loop. Seeds workspace/, runs the agent, executes tests in sandbox, and logs results.
agent.py - Builds a LangGraph ReAct agent backed by langchain‑ollama ChatOllama and wired with the tools defined below.
tools.py - Tool functions the agent can call.
sandbox.py - Minimal, Docker‑based Python sandbox (no network, CPU/mem/PID limits) used to run code and tests safely.
logger.py - Writes per‑run jsonl and keeps an up‑to‑date aggregated json file with session metadata and all runs.
workspace/ - Scratch directory used by the tools:
- buggy.py — seeded with the buggy function for the current task
- tests.py - unit tests for the task
- fixed.py - the agent’s proposed fix

Dependencies

Python (via requirements.txt):

langchain-ollama
langgraph~=0.6.7
langchain-core~=0.3.76
datasets~=4.1.1
tqdm~=4.66.5

Non‑Python:

Docker (required for sandboxed execution)
Ollama (local LLM runtime), with a suitable model pulled (default used in examples: qwen3:1.7b)

Quick Start

Install system deps

Install Docker and ensure your user can run it (docker run hello-world).
Install Ollama and pull a model, e.g.: ollama pull qwen3:1.7b.

Create a Python environment and install packages

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Run the evaluation

python run_eval.py --model qwen3:1.7b --temperature 0.8 --k-runs 1 --verbosity 0

Aggregate results

python aggregate_logs.py --latest
# Or specify a file:
python aggregate_logs.py --file logs/humaneval_<...>.json

How it works (high-level)

Evaluation Loop

For each task from the dataset:

The workspace/ subdirectory is seeded with the buggy implementation and tests from the dataset.
A ReAct agent is built, with access to the tools from tools.py.
The agent attempts to solve the task, writing the solution to workspace/fixed.py.
Each attempt (“run”) is logged by logger.py. This gets repeated for k_runs per each task.

Sandboxed Execution

Code execution happens inside a Docker container: python:3.11-alpine:

network disabled
CPU/memory/PID limits
read‑only root
/tmp as tmpfs
workspace mounted read‑write at /workspace

Tools

With the following tools, the agent is exposed to the context needed to solve the task.

get_buggy_code: read workspace/buggy.py
get_tests: read workspace/tests.py
write_fixed_solution: write workspace/fixed.py
read_fixed_solution: read workspace/fixed.py
run_tests_on_buggy: execute tests against buggy.py (in a sandbox)
run_tests_on_fixed: execute tests against fixed.py (in a sandbox)

Post-Work Thoughts

Regardless of the outcome of this submission, thank you for sharing this project. It was fun tinkering with LangGraph and docker to get it to work, and then shaping the module to polish it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyFixik

Results

Project structure

CLI User Scripts

Agent and Evaluation Code

Dependencies

Quick Start

How it works (high-level)

Evaluation Loop

Sandboxed Execution

Tools

Post-Work Thoughts

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
aggregate_logs.py		aggregate_logs.py
eval.py		eval.py
logger.py		logger.py
requirements.txt		requirements.txt
run_eval.py		run_eval.py
sandbox.py		sandbox.py
tools.py		tools.py

Mc-Seem/PyFixik

Folders and files

Latest commit

History

Repository files navigation

PyFixik

Results

Project structure

CLI User Scripts

Agent and Evaluation Code

Dependencies

Quick Start

How it works (high-level)

Evaluation Loop

Sandboxed Execution

Tools

Post-Work Thoughts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages