Skip to content

Conversation

@theobarberbany
Copy link
Contributor

@theobarberbany theobarberbany commented Nov 27, 2025

This is experimental.

Part of the cluster infra f2f hackathon held last week.

Vibe coded test suite to test our vibe coded commands to review our vibe coded code.

image

Uses LLM as a judge (claude) to review output of the /api-review command.

Each suite can be configured to use a different model. The default judge is haiku 4.5, golden runs on sonnet, and integration on opus. I'm not sure these are the best choices (other than haiku for the judge), but we should be able to gather some data 🤞 .

Directory Structure

tests/eval/testdata/
├── golden/                     # Base truth tests - single isolated issues
│   ├── missing-optional-doc/
│   │   ├── patch.diff          # Triggers ONLY missing-optional-doc
│   │   └── expected.txt
│   ├── undocumented-enum/
│   │   ├── patch.diff          # Triggers ONLY undocumented-enum
│   │   └── expected.txt
│   ├── missing-featuregate/
│   │   ├── patch.diff          # Triggers ONLY missing-featuregate
│   │   └── expected.txt
│   └── valid-api-change/
│       ├── patch.diff          # Triggers NO issues
│       └── expected.txt
└── integration/                # Complex scenarios - multiple issues
    ├── new-field-all-issues/
    │   ├── patch.diff          # Triggers multiple issues together
    │   └── expected.txt
    └── partial-documentation/
        ├── patch.diff
        └── expected.txt

Test Case Format

patch.diff

Standard git diff format:

diff --git a/config/v1/types.go b/config/v1/types.go
--- a/config/v1/types.go
+++ b/config/v1/types.go
@@ -10,0 +11,5 @@
+// MyField does something
+// +optional
+// +kubebuilder:validation:Enum=Foo;Bar
+MyField string `json:"myField"`

expected.txt

One expected issue per line:

enum values Foo and Bar not documented in comment
optional field does not explain behavior when omitted

Empty file means the API change should pass review with no issues.

Note: Order of issues in expected.txt does not matter. Comparison uses semantic matching, not exact string matching.

Example output where we catch two instances of the command hallucinating new rules:

Look at Judge result and Judge reason for explanations of decisions made.

Details
Running Suite: API Review Eval Suite - /Users/tbarberb/projects/api/tests/eval
==============================================================================
Random Seed: 1764259960

Will run 4 of 4 specs
------------------------------
[BeforeSuite]
/Users/tbarberb/projects/api/tests/eval/eval_test.go:80
  STEP: verifying local AGENTS.md exists @ 11/27/25 17:12:40.662
  STEP: verifying local .claude/commands/api-review.md exists @ 11/27/25 17:12:40.662
  STEP: creating temp directory for clone @ 11/27/25 17:12:40.662
  STEP: shallow cloning openshift/api @ 11/27/25 17:12:40.662
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:12:54.327
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:12:54.328
[BeforeSuite] PASSED [13.667 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues missing-optional-doc
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
  STEP: verifying origin remote exists @ 11/27/25 17:12:54.329
  STEP: resetting repo to clean state @ 11/27/25 17:12:54.341
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:12:54.832
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:12:54.832
  STEP: reading and applying patch @ 11/27/25 17:12:54.833
  STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:12:54.847
  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:14:19.61
  Judge result: pass=true, reason=The review correctly identified the core issue: the optional field does not explain behavior when omitted. The output semantic matches the expected issue by identifying that the comment lacks documentation of what happens when `customLogoURL` is not set (i.e., the default logo is used). The review provides appropriate context about the validation pattern allowing empty strings and suggests a fix that documents the omitted behavior.
• [88.746 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues undocumented-enum
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
  STEP: verifying origin remote exists @ 11/27/25 17:14:23.075
  STEP: resetting repo to clean state @ 11/27/25 17:14:23.097
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:14:23.263
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:14:23.263
  STEP: reading and applying patch @ 11/27/25 17:14:23.264
  STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:14:23.278

  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:16:31.141
  Judge result: pass=true, reason=The review correctly identified that enum values Light and Dark are not documented in the comment. The output expanded on this core issue by also noting incomplete enum documentation, missing specific default behavior (Light vs Dark), and formatting issues, but all of these are semantic expansions of the same fundamental problem: the enum values are not properly documented as required by the API review standards.
• [131.341 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues valid-api-change
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
  STEP: verifying origin remote exists @ 11/27/25 17:16:34.417
  STEP: resetting repo to clean state @ 11/27/25 17:16:34.433
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:16:34.589
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:16:34.59
  STEP: reading and applying patch @ 11/27/25 17:16:34.59
  STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:16:34.603
  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:17:26.436
  Judge result: pass=false, reason=The expected issues list is empty, but the API review output identifies a real issue: field documentation mismatch between 'bannerText' description and its enum validation. Since there are no expected issues to match against, and the review found an unrelated issue, this fails the comparison criteria.
  [FAILED] in [It] - /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:17:29.727
• [FAILED] [55.310 seconds]
API Review Evaluation Golden Tests should correctly identify single issues [It] valid-api-change
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199

  [FAILED] API review did not match expected issues.
  Judge reason: The expected issues list is empty, but the API review output identifies a real issue: field documentation mismatch between 'bannerText' description and its enum validation. Since there are no expected issues to match against, and the review found an unrelated issue, this fails the comparison criteria.
  Review output:
  ✅ Linting checks passed.

  Now analyzing the changed file for documentation compliance:

  ## API Review Report

  ### 📄 File: `config/v1/types_console.go`

  **Status: ❌ FAILED**

  ---

  **Line 49: Field name does not match documentation**

  **Current (problematic) code:**
  ```go
  // bannerText is an optional field that specifies a custom banner message
  // to display at the top of the console. Valid values are "Info", "Warning",
  // and "Error" which control the banner styling. When omitted, no banner
  // is displayed.
  // +optional
  // +kubebuilder:validation:Enum=Info;Warning;Error
  BannerText string `json:"bannerText,omitempty"`
  ```

  **Suggested change:**
  ```diff
  - // bannerText is an optional field that specifies a custom banner message
  + // bannerText is an optional field that specifies the type of banner message
    // to display at the top of the console. Valid values are "Info", "Warning",
    // and "Error" which control the banner styling. When omitted, no banner
    // is displayed.
  ```

  **Explanation:** The field name is `bannerText` but the documentation says it "specifies a custom banner message". However, this field only accepts enum values "Info", "Warning", or "Error" - it doesn't accept arbitrary custom messages. The documentation should clarify that this field specifies the **type** of banner, not the message content itself. If this truly is meant to hold custom banner text, then the `Enum` validation is incorrect and should be removed.

  ---

  ## Summary

  ❌ **Review Failed**: 1 issue found

  The new `BannerText` field has a mismatch between its documentation and validation constraints. The comment suggests it holds "a custom banner message" but the enum validation only allows three specific values ("Info", "Warning", "Error"). Please clarify:

  1. If this field should accept arbitrary custom text, remove the `Enum` validation
  2. If this field should only accept enum values, update the documentation to reflect that it specifies the banner **type**, not custom message content

  Expected issues:

  Expected
      <bool>: false
  to be true
  In [It] at: /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:17:29.727
------------------------------
API Review Evaluation Integration Tests should correctly identify multiple issues new-field-multiple-issues
/Users/tbarberb/projects/api/tests/eval/eval_test.go:216
  STEP: verifying origin remote exists @ 11/27/25 17:17:29.742
  STEP: resetting repo to clean state @ 11/27/25 17:17:29.757
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:17:29.915
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:17:29.915
  STEP: reading and applying patch @ 11/27/25 17:17:29.916
  STEP: running API review via Claude (claude-opus-4-5@20251101) @ 11/27/25 17:17:29.931
  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:18:31.13
  Judge result: pass=false, reason=The review identified an extra issue not in the expected list: 'missing URL validation pattern for customLogoURL' was reported as a separate finding beyond the expected 'missing length validation for customLogoURL'. While the review correctly identified all 4 expected issues (optional field behavior, URL validation, FeatureGate, and length validation), it reported validation as a separate concern than length validation, which appears to be treating them as distinct problems rather than recognizing they may be the same underlying validation gap.
  [FAILED] in [It] - /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:18:34.472
• [FAILED] [64.730 seconds]
API Review Evaluation Integration Tests should correctly identify multiple issues [It] new-field-multiple-issues
/Users/tbarberb/projects/api/tests/eval/eval_test.go:216

  [FAILED] API review did not match expected issues.
  Judge reason: The review identified an extra issue not in the expected list: 'missing URL validation pattern for customLogoURL' was reported as a separate finding beyond the expected 'missing length validation for customLogoURL'. While the review correctly identified all 4 expected issues (optional field behavior, URL validation, FeatureGate, and length validation), it reported validation as a separate concern than length validation, which appears to be treating them as distinct problems rather than recognizing they may be the same underlying validation gap.
  Review output:
  ## API Review Results

  **File reviewed:** `config/v1/types_console.go`

  **Linting:** ✅ Passed (0 issues)

  ---

  ### ❌ Review FAILED - Documentation Issues Found

  **+44: Missing optional field behavior documentation**
  **Current (problematic) code:**
  ```go
        // customLogoURL specifies a URL for a custom logo image.
        // +optional
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Suggested change:**
  ```diff
  -     // customLogoURL specifies a URL for a custom logo image.
  +     // customLogoURL specifies a URL for a custom logo image.
  +     // When omitted, the default OpenShift logo will be displayed.
  ```

  **Explanation:** Optional fields MUST explain the behavior when the field is not present.

  ---

  **+44: Missing URL validation**
  **Current (problematic) code:**
  ```go
        // customLogoURL specifies a URL for a custom logo image.
        // +optional
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Suggested change:**
  ```diff
        // customLogoURL specifies a URL for a custom logo image.
  +     // The URL must be a valid HTTPS URL.
  +     // When omitted, the default OpenShift logo will be displayed.
        // +optional
  +     // +kubebuilder:validation:Pattern=`^$|^https://.*$`
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Explanation:** URL fields should have validation to ensure proper format. Compare with `LogoutRedirect` field at line 83 which validates HTTPS URLs with a pattern.

  ---

  **+44: New stable API field should be feature-gated**
  **Current (problematic) code:**
  ```go
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Suggested change:**
  ```diff
  +     // +openshift:enable:FeatureGate=ConsoleCustomLogo
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Explanation:** Per CLAUDE.md guidelines, new fields on stable APIs should be introduced behind a feature gate.

  ---

  ### Summary

  | Check | Status |
  |-------|--------|
  | Linting | ✅ Pass |
  | Field Documentation | ✅ Pass |
  | Optional Behavior Documented | ❌ Fail |
  | Validation Markers Present | ❌ Fail |
  | Feature Gate for New Field | ❌ Fail |

  Expected issues:
  optional field customLogoURL does not explain behavior when omitted
  missing URL validation pattern for customLogoURL
  missing FeatureGate for new field on stable API
  missing length validation for customLogoURL
  Expected
      <bool>: false
  to be true
  In [It] at: /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:18:34.472
------------------------------
[AfterSuite]
/Users/tbarberb/projects/api/tests/eval/eval_test.go:114
  STEP: cleaning up temp directory @ 11/27/25 17:18:34.49
[AfterSuite] PASSED [0.853 seconds]
------------------------------

Summarizing 2 Failures:
  [FAIL] API Review Evaluation Golden Tests should correctly identify single issues [It] valid-api-change
  /Users/tbarberb/projects/api/tests/eval/eval_test.go:304
  [FAIL] API Review Evaluation Integration Tests should correctly identify multiple issues [It] new-field-multiple-issues
  /Users/tbarberb/projects/api/tests/eval/eval_test.go:304

Ran 4 of 4 Specs in 354.680 seconds
FAIL! -- 2 Passed | 2 Failed | 0 Pending | 0 Skipped
--- FAIL: TestEval (354.68s)
FAIL

Ginkgo ran 1 suite in 5m55.161278125s

Test Suite Failed

@openshift-ci-robot
Copy link

Pipeline controller notification
This repository is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 27, 2025

Hello @theobarberbany! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@coderabbitai
Copy link

coderabbitai bot commented Nov 27, 2025

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 27, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@theobarberbany theobarberbany force-pushed the test-evals branch 2 times, most recently from 99de6d4 to 06bca4c Compare November 27, 2025 13:54
@theobarberbany theobarberbany marked this pull request as draft November 27, 2025 14:15
This builds a basic go test suite that uses claude as a judge to review
the output of the /api-review command.

See tests/eval/DESIGN.md for more details.
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 27, 2025
@theobarberbany theobarberbany changed the title WIP: evals for api review command evals for api review command Dec 1, 2025
@theobarberbany theobarberbany marked this pull request as ready for review December 2, 2025 09:37
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

@theobarberbany: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/integration a999d82 link true /test integration

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants