CommitUp, an LLM-based test suite evolution tool for Java projects.
- Python
3.11+ - Java
1.8 - Java
15+is required for call graph generation (java-callgraph-1.0-SNAPSHOT-jar-with-dependencies.jar) - Maven
3.2.5+ - Git
uv
Optional (model backend):
qwen7b: local vLLM (http://localhost:8000/v1)qwen7b-ollama: local Ollama +qwen2.5-coder:latest
uv python install 3.11
uv syncmkdir -p exp_repos
bash exp_repos/clone_repos.shPlace a local bge-reranker-v2-gemma model directory, for example:
<repo_root>/models/bge-reranker-v2-gemma
cp .env.template .envRecommended .env (absolute paths):
ROOT_PATH="/absolute/path/to/CommitUp"
JAVA_REPO_PATH="/absolute/path/to/CommitUp/exp_repos"
RERANKER_PATH="/absolute/path/to/CommitUp/models/bge-reranker-v2-gemma"
DATASET_PATH="/absolute/path/to/CommitUp/data/benchmark.json"
DEEPSEEK_API_KEY="your_deepseek_api_key"
OPENAI_API_URL="https://your-openai-compatible-endpoint/v1"
OPENAI_API_KEY="your_openai_api_key"
POM_PATH=""CommitUp/
│
├── README.md
├── README_EN.md
├── main.py
├── components.py
├── data/
│ ├── benchmark.json
│ └── load_data.py
├── core/
│ ├── agents/
│ ├── env/
│ ├── llms/
│ ├── prompts/
│ └── rag/
├── exp_repos/
│ └── clone_repos.sh
└── run_logs/
Supported models:
deepseek-chatgpt-4oqwen7bqwen7b-ollama
Run one sample:
uv run -m main --model deepseek-chat --start-index 0 --end-index 1 --temperature 0Run full benchmark:
uv run -m main --model deepseek-chat --start-index 0 --end-index 248 --temperature 0Use OpenAI-compatible backend:
uv run -m main --model gpt-4o --start-index 0 --end-index 248 --temperature 0Use local vLLM:
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct --port 8000
uv run -m main --model qwen7b --start-index 0 --end-index 248 --temperature 0Output files:
run_logs/benchmark__<uuid>/<case_index>__<instance_id>/result.json
- The run process executes
git reset --hardandgit checkout -f <commit>insideexp_repos. - Do not keep uncommitted work in repositories under
exp_repos. - Use absolute paths for
JAVA_REPO_PATH,DATASET_PATH, andRERANKER_PATH. - Always run one sample first before full benchmark.
- Full runs are long; chunked execution is recommended.