Skip to content

rkp4u/test_agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Agent Forge

AI-powered test generation agent that analyzes GitHub PR diffs, identifies untested code paths, generates targeted tests, runs them, and self-corrects through a reflexion loop.

Two modes: coverage (fill gaps in existing tests) and mutation (find bugs your tests miss).

Built with LangGraph (Python) using a Plan-and-Execute + Reflexion architecture, inspired by Meta's Automated Compliance Hardening (ACH) research.


Modes

Coverage Mode (default)

Generates tests targeting uncovered code paths. Measures line coverage and improves it.

agent-forge run owner/repo --pr 42
# or explicitly:
agent-forge run owner/repo --pr 42 --mode coverage

Mutation Mode

Injects realistic bugs into changed code, runs existing tests to find which bugs go undetected (surviving mutants), then generates targeted killing tests to catch them. Reports a mutation score โ€” a better quality metric than line coverage.

agent-forge run owner/repo --pr 42 --mode mutation

How It Works

Coverage Pipeline

PR Diff โ†’ Code Analysis โ†’ Coverage Check โ†’ Test Generation โ†’ Test Execution โ†’ Self-Correction โ†’ Report
  1. Fetches the PR diff from GitHub
  2. Parses changed files with tree-sitter to extract method signatures, annotations, and dependencies
  3. Detects the build tool (Gradle/Maven) and existing test coverage
  4. Generates targeted JUnit 5 tests using GPT-4o, based on uncovered code paths
  5. Compiles and runs tests using ./gradlew test
  6. Self-corrects โ€” if tests fail, a critic analyzes errors and the generator fixes them (up to 3 iterations)

Mutation Pipeline (extends coverage)

[Coverage phase completes]
    โ†“
Mutation Generator โ†’ Equivalence Detector โ†’ Mutation Runner
    โ†“
Killing Test Generator โ†’ 3-Stage Filter โ†’ Mutation Critic โ†’ Report
  1. Generates mutations โ€” realistic bugs in changed code (off-by-one, wrong operator, missing null check, etc.) using gpt-4o-mini at temperature 0.7 โ€” a different model/temperature than the test generator to prevent AI blind spots
  2. Filters equivalents โ€” LLM-as-judge removes mutations that are logically identical to original code
  3. Runs mutation tests โ€” injects each mutation, runs all tests, records killed vs. survived
  4. Generates killing tests โ€” for each surviving mutant, writes a test that passes on original code but fails on the mutant
  5. 3-stage filter โ€” each killing test must: compile โ†’ pass original โ†’ fail mutant
  6. Reflexion loop โ€” up to 2 iterations to fix rejected killing tests

Prerequisites

  • Python 3.12+
  • GitHub CLI (gh) โ€” authenticated
  • Java 21 โ€” for running generated tests against Java projects
  • OpenAI API key โ€” for LLM-powered test generation

Setup

1. Clone and install

git clone https://github.com/rkp4u/test_agent.git
cd test_agent
python3 -m venv .venv
source .venv/bin/activate
make dev

2. Configure environment

cp .env.example .env

Edit .env:

OPENAI_API_KEY=sk-proj-your-key-here
GITHUB_TOKEN=ghp_your-token-here   # optional, gh CLI preferred

3. Authenticate GitHub CLI

brew install gh    # macOS
gh auth login
gh auth status     # verify

Usage

Run coverage mode

agent-forge run owner/repo --pr 42

Run mutation mode

agent-forge run owner/repo --pr 42 --mode mutation -v

CLI options

Option Description Default
--mode coverage or mutation coverage
--pr, -p PR number (required) โ€”
--max-iterations, -m Max reflexion iterations 3
--verbose, -v Show detailed output off
--dry-run Analyze without running tests off

Example Output

Coverage mode

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Agent Forge โ€” Test Generation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Repository: rkp4u/agent_demo                                    โ”‚
โ”‚ PR: #1 โ€” Add TransactionMetrics for investigation performance   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โœ“ [1/7] Planning
  โœ“ [2/7] Fetching PR diff (1 files changed)
  โœ“ [3/7] Analyzing code (tree-sitter)
    โ”œโ”€โ”€ 6 new/modified methods found
  โœ“ [4/7] Checking existing coverage
  โœ“ [5/7] Generating tests
    โ”œโ”€โ”€ 1 test files generated
    โ””โ”€โ”€ 7 test methods targeting uncovered paths
  โœ“ [6/7] Running tests
    โ”œโ”€โ”€ 5 passed, 2 failed
    โ””โ”€โ”€ Entering reflexion loop...
  โœ“ [5/7] Regenerating tests (iteration 2/3)
  โœ“ [6/7] Running tests โ€” 7 passed, 0 failed
  โœ“ [7/7] Generating report

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Tests generated:  7 (in 1 files)             โ”‚
โ”‚ Iterations:       2/3                        โ”‚
โ”‚ Results:          7 passed, 0 failed โœ“       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Mutation mode

  โœ“ [1-7] Coverage phase (7 tests, all passing)
  โœ“ [8/12] Generating mutations (6 mutants)
  โœ“ [9/12] Filtering equivalent mutants (6 remain, 0 removed)
  โœ“ [10/12] Running tests against mutants
    โ”œโ”€โ”€ 6 killed, 0 survived, 0 build failures
    โ””โ”€โ”€ All mutants caught by existing tests!
  โœ“ [11/12] Generating killing tests (0 needed)
  โœ“ [12/12] Generating report

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Mutation Testing Results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚   Mutation score:  100% (6/6 mutants killed)             โ”‚
โ”‚ Killed by existing: 6                                    โ”‚
โ”‚ Killed by new tests: 0                                   โ”‚
โ”‚    Still surviving: 0                                    โ”‚
โ”‚  Equivalent (filtered): 0                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Architecture

Full graph (mutation mode)

START โ†’ Planner โ†’ Diff Fetcher โ†’ Code Analyzer โ†’ Coverage Checker
    โ†’ Test Generator โ†’ Test Runner โ†’ Critic
                                         โ†™ (fail, max 3)    โ†˜ (pass)
                               Test Generator           [MODE ROUTER]
                                                      โ†™               โ†˜
                                              (coverage)           (mutation)
                                                  โ†“                    โ†“
                                              Reporter        Mutation Generator
                                                 โ†“                    โ†“
                                                END         Equivalence Detector
                                                                       โ†“
                                                            Mutation Runner
                                                                       โ†“
                                                       Killing Test Generator
                                                                       โ†“
                                                        Killing Test Runner
                                                                       โ†“
                                                          Mutation Critic
                                                        โ†™ (retry, max 2)  โ†˜ (done)
                                             Killing Test Generator      Reporter โ†’ END

Key design decisions:

  • LangGraph for stateful workflow with conditional edges and two independent reflexion loops
  • Different models per phase โ€” mutation generator uses gpt-4o-mini at temp 0.7, test generator uses gpt-4o at temp 0.2 โ€” prevents AI blind spots
  • String-based mutation injection โ€” replaces exact code snippets rather than line numbers for robustness
  • 3-stage filter for killing tests โ€” compile โ†’ pass original โ†’ fail mutant (eliminates false positives)
  • tree-sitter for multi-language AST parsing

Project Structure

src/agent_forge/
โ”œโ”€โ”€ cli/
โ”‚   โ”œโ”€โ”€ app.py              # Commands: run, analyze, version. --mode flag
โ”‚   โ””โ”€โ”€ display.py          # Rich panels, mutation score tables, surviving mutants
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ settings.py         # All settings including mutation model/temperature
โ”œโ”€โ”€ engine/
โ”‚   โ”œโ”€โ”€ graph.py            # LangGraph StateGraph with mode_router conditional edge
โ”‚   โ”œโ”€โ”€ state.py            # AgentState โ€” 20+ fields including mutation state
โ”‚   โ”œโ”€โ”€ nodes/
โ”‚   โ”‚   โ”œโ”€โ”€ planner.py
โ”‚   โ”‚   โ”œโ”€โ”€ diff_fetcher.py
โ”‚   โ”‚   โ”œโ”€โ”€ code_analyzer.py
โ”‚   โ”‚   โ”œโ”€โ”€ coverage_checker.py
โ”‚   โ”‚   โ”œโ”€โ”€ test_generator.py
โ”‚   โ”‚   โ”œโ”€โ”€ test_runner.py
โ”‚   โ”‚   โ”œโ”€โ”€ critic.py
โ”‚   โ”‚   โ”œโ”€โ”€ reporter.py
โ”‚   โ”‚   โ”œโ”€โ”€ mutation_generator.py     # Generates realistic bugs via LLM
โ”‚   โ”‚   โ”œโ”€โ”€ equivalence_detector.py  # LLM-as-judge filtering
โ”‚   โ”‚   โ”œโ”€โ”€ mutation_runner.py        # Injects mutations, runs tests
โ”‚   โ”‚   โ”œโ”€โ”€ killing_test_generator.py # Generates bug-catching tests
โ”‚   โ”‚   โ”œโ”€โ”€ killing_test_runner.py    # 3-stage filter pipeline
โ”‚   โ”‚   โ””โ”€โ”€ mutation_critic.py        # Rule-based reflexion feedback
โ”‚   โ””โ”€โ”€ prompts/
โ”‚       โ”œโ”€โ”€ test_generation.py
โ”‚       โ””โ”€โ”€ mutation.py               # 6 prompt builders for mutation pipeline
โ”œโ”€โ”€ tools/
โ”‚   โ”œโ”€โ”€ github/             # GitHub client (gh CLI + PyGithub fallback)
โ”‚   โ”œโ”€โ”€ analysis/           # tree-sitter AST analyzer
โ”‚   โ””โ”€โ”€ runners/
โ”‚       โ”œโ”€โ”€ gradle.py
โ”‚       โ””โ”€โ”€ mutation_injector.py      # Context manager: inject โ†’ test โ†’ restore
โ””โ”€โ”€ models/
    โ”œโ”€โ”€ ...
    โ””โ”€โ”€ mutation.py                   # Mutant, MutationRunResult, KillingTestResult

Configuration

Variable Description Default
OPENAI_API_KEY OpenAI API key (required)
GITHUB_TOKEN GitHub PAT (fallback if gh CLI unavailable) (optional)
AGENT_FORGE_MODEL Model for test generation gpt-4o
AGENT_FORGE_TEMPERATURE Temperature for test generation 0.2
AGENT_FORGE_MAX_REFLEXION_ITERATIONS Coverage reflexion iterations 3
AGENT_FORGE_TEST_TIMEOUT_SECONDS Test execution timeout (seconds) 300
AGENT_FORGE_MUTATION_MODEL Model for mutation generation gpt-4o-mini
AGENT_FORGE_MUTATION_TEMPERATURE Temperature for mutation generation 0.7
AGENT_FORGE_EQUIVALENCE_MODEL Model for equivalence detection gpt-4o-mini
AGENT_FORGE_MAX_MUTANTS_PER_PR Cap on mutations generated per PR 12
AGENT_FORGE_MAX_MUTATION_ITERATIONS Killing test reflexion iterations 2

Model Strategy

Node Model Temp Rationale
Test generator gpt-4o 0.2 Precision โ€” tests must be syntactically correct
Mutation generator gpt-4o-mini 0.7 Creative โ€” different model prevents AI blind spots
Equivalence detector gpt-4o-mini 0.0 Cheap binary classification
Killing test generator gpt-4o 0.2 Precision โ€” must compile and pass 3-stage filter

Using the same model that writes tests to also generate mutations creates a systematic blind spot โ€” it tends to generate bugs the model "knows" to avoid. Separating models is key to mutation effectiveness.


Supported Languages

Language AST Parsing Test Generation Mutation Testing
Java (JUnit 5 + Gradle) โœ… tree-sitter โœ… GPT-4o โœ…
Python (pytest) Planned Planned Planned
TypeScript (Jest) Planned Planned Planned
Kotlin (JUnit 5) Planned Planned Planned

Roadmap

  • JaCoCo coverage collection (real line-level coverage deltas)
  • Maven runner support
  • Python + TypeScript language handlers
  • --output json for CI integration
  • GitHub Actions integration โ€” post mutation score as PR comment
  • Report persistence and agent-forge report <id> command
  • Penetration testing profile (semgrep-based static analysis)

Development

make test          # Unit tests
make lint          # Check with ruff
make format        # Auto-fix with ruff
make typecheck     # mypy

License

Apache 2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors