Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
We introduce PostTrainBench, a benchmark that measures the ability of CLI agents to post-train pre-trained large language models (LLMs). In PostTrainBench, the agent's task is to improve the performance of a base LLM on a given benchmark. The agent is given access to an evaluation script and 10 hours on an H100 GPU. Performance is measured by the benchmark score of the post-trained LLM. This setup naturally evaluates an agent's ability to conduct AI R&D.

> [!IMPORTANT]
> **Harbor support coming soon!** This repository currently targets our internal HPC cluster (HTCondor). We are adding [Harbor](https://github.com/harbor-framework/harbor) support to make it straightforward to run on rented hardware (e.g., cloud GPUs). See our [PR](https://github.com/aisa-group/PostTrainBench/pull/8).
> **Harbor support is available in `src/harbor_adapter`.** The adapter generates [Harbor](https://github.com/harbor-framework/harbor) tasks for running PostTrainBench on rented hardware (e.g., cloud GPUs), while the existing scripts continue to target our internal HPC cluster (HTCondor).

## Leaderboard

Expand Down Expand Up @@ -94,7 +94,7 @@ The `.env` file contains API keys and configuration. See `example.env` for all a

Environment variables already set in your shell take precedence over `.env` values.

Currently, we only support the HTCondor job scheduler. [Harbor](https://github.com/harbor-framework/harbor) support is planned.
HTCondor remains the primary supported scheduler for the main job scripts. Harbor task generation is available in `src/harbor_adapter` for cloud GPU runs.

#### API-based agents

Expand Down
180 changes: 180 additions & 0 deletions src/harbor_adapter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# PostTrainBench Harbor Adapter

This adapter generates [Harbor](https://harborframework.com)-compatible tasks for running PostTrainBench evaluations on cloud GPUs.

## Supported Benchmarks

| Benchmark ID | Name | Type | Notes |
|-------------|------|------|-------|
| gsm8k | GSM8K (Grade School Math 8K) | inspect-ai | |
| humaneval | HumanEval | inspect-ai | |
| aime2025 | AIME 2025 | inspect-ai | |
| gpqamain | GPQA | inspect-ai | |
| bfcl | Berkeley Function Calling Leaderboard | inspect-ai | Includes `bfcl_evaluation_code.py` via task_context |
| arenahardwriting | Arena-Hard-v2.0 (Writing) | vLLM + OpenAI-compatible judge | Requires `OPENAI_API_KEY` for agent eval |
| healthbench | HealthBench | vLLM + OpenAI-compatible judge | Requires `OPENAI_API_KEY` for agent eval |

## Supported Models

| Key | HuggingFace Model ID |
|-----|---------------------|
| qwen3-1.7b | Qwen/Qwen3-1.7B-Base |
| qwen3-4b | Qwen/Qwen3-4B-Base |
| smollm3-3b | HuggingFaceTB/SmolLM3-3B-Base |
| gemma3-4b | google/gemma-3-4b-pt |

Total: **28 tasks** (7 benchmarks x 4 models).

## Installation

The adapter itself only uses the Python standard library. Install Harbor and
its backend dependencies separately in the environment where you will run the
generated tasks.

## Quick Start

### 1. Generate tasks

```bash
# Generate a single task
uv run python3 src/harbor_adapter/run_adapter.py \
--benchmark gsm8k \
--model qwen3-1.7b \
--output ./tasks

# Or generate all 28 task combinations
uv run python3 src/harbor_adapter/run_adapter.py --all --output ./tasks

# List available benchmarks and models
uv run python3 src/harbor_adapter/run_adapter.py --list
```

### 2. Set API keys

```bash
modal setup # Modal cloud setup
export ANTHROPIC_API_KEY=<your-key> # For Claude agent
export OPENAI_API_KEY=<your-key> # OpenAI-compatible key for Codex judge + selected evals
export OPENAI_BASE_URL=https://api.pinference.ai/api/v1
```

### 3. Run with Harbor

```bash
harbor run \
--path ./tasks/posttrainbench-gsm8k-qwen3-1.7b \
--agent claude-code \
--model anthropic/claude-sonnet-4 \
--env modal
```

## API Key Requirements

| Key | Used By | Required For |
|-----|---------|-------------|
| `ANTHROPIC_API_KEY` | Agent (Claude) | All benchmarks |
| `OPENAI_API_KEY` | Codex judge, evaluation judge | All benchmarks (judge), arenahardwriting/healthbench (agent eval) |
| `OPENAI_BASE_URL` | OpenAI-compatible endpoint | Defaults to Pinference in generated tasks |
| `CODEX_MODEL` | Codex judge model | Defaults to `openai/gpt-5.1-codex` |

- The verifier receives `OPENAI_API_KEY` as both `OPENAI_API_KEY` and `CODEX_API_KEY` (codex CLI reads `CODEX_API_KEY`).
- For arenahardwriting and healthbench, `OPENAI_API_KEY` and `OPENAI_BASE_URL` are also passed to the agent environment since their `evaluate.py` scripts call an OpenAI-compatible API for judging.

## Task Structure

Each generated task follows Harbor's standard format:

```
posttrainbench-gsm8k-qwen3-1.7b/
├── task.toml # Task configuration (GPU, timeout, env vars)
├── instruction.md # Instructions for the agent
├── environment/
│ ├── Dockerfile # Container definition (CUDA + vLLM + ML packages)
│ ├── .dockerignore # Excludes Dockerfile from COPY
│ ├── evaluate.py # Benchmark evaluation script
│ ├── timer.sh # Countdown timer backed by /timer_start
│ ├── metadata.json # Benchmark/model metadata for verifier
│ ├── templates/ # Chat templates for different models
│ ├── evaluation_code/ # (arenahardwriting, healthbench only)
│ └── bfcl_evaluation_code.py # (bfcl only, from task_context)
└── tests/
├── test.sh # Verifier: judge-v2 + 3-phase eval retry
├── src/
│ ├── disallowed_usage_judge/ # Judge-v2 prompts and tools
│ ├── trace_parsing/ # Agent/Codex trace parsers
│ └── eval/tasks/<benchmark>/ # Judge metadata and optional test data
└── ... # Untampered verifier copy of eval files
```

## Evaluation Retry Logic

The verifier (`test.sh`) uses a 3-phase evaluation retry strategy matching `run_task.sh`:

| Phase | Max Attempts | Token Limits |
|-------|-------------|-------------|
| 1 | 4 | Default |
| 2 | 3 | Reduced (see below) |
| 3 | 2 | Further reduced (see below) |

Token limits per benchmark:

| Benchmark | Phase 2 | Phase 3 |
|-----------|---------|---------|
| aime2025 | `--max-tokens 12000` | `--max-tokens 8000` |
| arenahardwriting | `--max-new-tokens 12288` | `--max-new-tokens 8192` |
| bfcl | `--max-tokens 12000` | `--max-tokens 8000` |
| gpqamain | `--max-tokens 12000` | `--max-tokens 8000` |
| gsm8k | `--max-tokens 3000` | `--max-tokens 2000` |
| healthbench | `--max-new-tokens 12288` | `--max-new-tokens 8192` |
| humaneval | `--max-tokens 3000` | `--max-tokens 2000` |

GPU processes are killed between attempts to free VRAM.

## Judge V2

The verifier runs the same judge-v2 flow as `src/disallowed_usage_judge/run_judge.sh`,
but directly inside the Harbor verifier container:

```bash
codex --search -a never exec --json \
-c model_reasoning_summary=detailed \
-c model_reasoning_effort=xhigh \
--skip-git-repo-check --yolo \
--model "${CODEX_MODEL:-openai/gpt-5.1-codex}" \
"$JUDGE_PROMPT"
```

It runs two judges:

- The canonical contamination/base-model judge writes `/logs/verifier/judgement_gpt5_4.json`.
- The API usage judge writes `/logs/verifier/judgement_api.json` for audit/postmortem use.

The reward is zeroed only when `judgement_gpt5_4.json` flags contamination or
disallowed base-model usage.

## Timer

The timer reads `/timer_start`, which the Harbor healthcheck writes immediately
before the agent launches. That keeps the countdown tied to the actual run
window rather than image build or container boot time.

## Configuration

| Setting | Default | Notes |
|---------|---------|-------|
| GPU | 1x H100 | Configured in task.toml |
| Memory | 64 GB | |
| Storage | 100 GB | |
| Agent timeout | 10 hours | Adjustable via `--num-hours` |
| Verifier timeout | 3 hours | Accommodates 3-phase retry |
| Internet | Enabled | |

## Scoring

The verifier extracts the accuracy metric from `metrics.json` as the reward (0-1 scale). Results are stored in:
- `/logs/verifier/metrics.json` - Full evaluation metrics
- `/logs/verifier/reward.txt` - Accuracy score
- `/logs/verifier/judgement_gpt5_4.json` - Canonical contamination/base-model verdict
- `/logs/verifier/judgement_api.json` - API usage verdict
- `/logs/verifier/judge_output_gpt5_4.txt` - Parsed canonical judge transcript
- `/logs/verifier/judge_output_api.txt` - Parsed API judge transcript
15 changes: 15 additions & 0 deletions src/harbor_adapter/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""PostTrainBench Harbor Adapter - Generate Harbor tasks for LLM post-training evaluation."""

from .adapter import (
PostTrainBenchAdapter,
BENCHMARKS,
MODELS,
list_available_tasks,
)

__all__ = [
"PostTrainBenchAdapter",
"BENCHMARKS",
"MODELS",
"list_available_tasks",
]
Loading