refactor: git SHA-based Docker tags by default, versioned tags only on releases #1088

xingyaoww · 2025-11-06T20:31:12Z

Overview

This PR refactors the Docker image tagging strategy to use git commit SHAs for regular builds and semantic versions only for official releases. This ensures reproducibility while keeping the tagging system clean and maintainable.

Key Changes

1. Git SHA-Based Tagging (Default Behavior)

Regular builds (PR, main branch, manual triggers) now use git commit SHA for image tags:

22e00fb-python          # Commit SHA + variant
22e00fb-java            # Commit SHA + variant
main-python             # main branch builds also get this tag

Benefits:

✅ Reproducible: Every commit produces a unique, traceable image
✅ Simple: No confusion between package version and actual code version
✅ Git-native: Uses git as the source of truth

2. Semantic Version Tags Only on Git Tag Releases

Semantic versioned tags (e.g., v1.0.0_...) are only created when:

A git tag is pushed (any tag: 1.0.0, 1.0.0a5, 1.0.0rc1, etc.)
The workflow detects refs/tags/* trigger
The build includes --versioned-tag flag

Example workflow:

# Developer pushes a git tag
git tag 1.0.0
git push origin 1.0.0

# GitHub Actions automatically builds and tags:
# - 22e00fb-python (commit SHA tag - always included)
# - v1.0.0_nikolaik_s_python-nodejs_tag_python3.12-nodejs22_binary (versioned tag)

3. SDK_SHA and SDK_REF Environment Variable Overrides

For CI systems and submodule scenarios where git commands may not work correctly, you can now explicitly override git information:

# Priority order for SHA:
# 1. SDK_SHA (explicit override - NEW!)
# 2. GITHUB_SHA (GitHub Actions)
# 3. git rev-parse HEAD (local)

# Priority order for REF:
# 1. SDK_REF (explicit override - NEW!)
# 2. GITHUB_REF (GitHub Actions)
# 3. git symbolic-ref HEAD (local)

Use case - Git submodules:

# In parent repo (benchmarks) that vendors SDK as submodule
def get_submodule_commit():
    result = subprocess.run(
        ["git", "rev-parse", "--short=7", "HEAD:vendor/software-agent-sdk"],
        capture_output=True, text=True
    )
    return result.stdout.strip()

# Pass to SDK build
os.environ["SDK_SHA"] = get_submodule_commit()  # e.g., "a612c0a"
os.environ["SDK_REF"] = "refs/heads/main"

4. Target Suffix Simplification

Removed target suffixes from binary builds (the default) to keep tags clean:

Before:

22e00fb-python-binary          # Redundant suffix
22e00fb-python-source          # Source build

After:

22e00fb-python                 # Binary (default, no suffix)
22e00fb-python-source          # Source build (explicit suffix)

Workflow Changes

`.github/workflows/server.yml`

Triggers:

✅ Push to main → SHA-based tags only
✅ Pull requests → SHA-based tags only
✅ Any git tag push → SHA-based tags + versioned tags
✅ Manual dispatch → SHA-based tags only

Tag logic:

# Always build SHA-based tags
uv run build.py --build-ctx-only --arch amd64

# Add versioned tag when triggered by git tag
if [[ "${{ github.ref }}" == refs/tags/* ]]; then
    uv run build.py --build-ctx-only --arch amd64 --versioned-tag
fi

`.github/workflows/pypi-release.yml`

No changes - PyPI publishing remains completely separate from Docker image building. Publishing to PyPI does not trigger Docker builds.

API Changes

`build.py` Changes

Added:

SDK_SHA environment variable (higher priority than GITHUB_SHA)
SDK_REF environment variable (higher priority than GITHUB_REF)
--versioned-tag CLI flag (default: False)
BuildOptions.include_versioned_tag field (default: False)

Changed:

Separated _git_info() (returns git SHA/ref) from _package_version() (returns semantic version)
SHORT_SHA used for commit-based tags
PACKAGE_VERSION used only for versioned tags (when --versioned-tag is set)
Binary builds no longer include -binary suffix (except in versioned tags)

Removed:

BuildOptions.is_dev property (no longer needed)

Example Tag Outputs

Regular PR/Push (SHA-based only)

22e00fb-python-amd64
22e00fb-python-arm64
22e00fb-python           # Multi-arch manifest

Main Branch Push (SHA + main tags)

22e00fb-python
main-python              # Also tagged as "main"

Git Tag Release (SHA + versioned tags)

# Push git tag
git tag 1.0.0
git push origin 1.0.0

# Results in:
22e00fb-python
v1.0.0_nikolaik_s_python-nodejs_tag_python3.12-nodejs22_binary

Non-binary Target (with suffix)

22e00fb-python-source           # Source build
22e00fb-python-minimal          # Minimal build

Benefits of This Approach

🎯 Reproducibility: Every build is tied to an exact git commit
🧹 Cleanliness: No versioned tags polluting every build
📦 Separation of Concerns: PyPI releases separate from Docker builds
🏷️ Semantic Releases: Versioned tags only appear when you explicitly release (git tag)
🔧 CI Flexibility: Override SHA/REF for non-GitHub CI systems and submodules
🎨 Simplicity: Binary (default) builds have clean tags without redundant suffixes

Backward Compatibility

✅ Fully backward compatible:

Existing tag formats still work
SDK_SHA/SDK_REF only used when explicitly set
--versioned-tag defaults to False (opt-in)
No breaking changes to existing workflows

Testing

✅ All pre-commit checks pass (ruff, pycodestyle, pyright)
✅ New tests added for SDK_SHA and SDK_REF overrides
✅ 6 new test cases covering priority order and submodule scenarios
✅ Existing build.py functionality preserved

Migration Guide

For SDK Repository

No changes needed - the new behavior is automatic:

Regular builds use SHA tags
Git tag pushes include versioned tags

For Downstream Projects (Submodule Use Case)

If you vendor the SDK as a submodule and need to override git info:

# Extract submodule commit from parent repo
submodule_sha = subprocess.run(
    ["git", "rev-parse", "--short=7", "HEAD:path/to/sdk"],
    capture_output=True, text=True
).stdout.strip()

# Set overrides before building
os.environ["SDK_SHA"] = submodule_sha
os.environ["SDK_REF"] = "refs/heads/main"  # or actual ref

# Build as normal
subprocess.run(["uv", "run", "build.py", ...])

Visual Explanation

Regular Build Flow (PR/Main)

┌─────────────────────────────────────────┐
│ Push commit 22e00fb                     │
│ ↓                                       │
│ GitHub Actions triggered                │
│ ↓                                       │
│ build.py (no --versioned-tag)           │
│ ↓                                       │
│ Tags: 22e00fb-python, 22e00fb-java      │
└─────────────────────────────────────────┘

Release Build Flow (Git Tag)

┌─────────────────────────────────────────┐
│ git tag 1.0.0 && git push origin 1.0.0  │
│ ↓                                       │
│ GitHub Actions (refs/tags/1.0.0)        │
│ ↓                                       │
│ build.py --versioned-tag                │
│ ↓                                       │
│ Tags:                                   │
│   - 22e00fb-python (SHA tag)            │
│   - v1.0.0_..._binary (versioned tag)   │
└─────────────────────────────────────────┘

Submodule Override Flow

┌───────────────────────────────────────────────┐
│ benchmarks/ (parent repo)                     │
│ └── vendor/software-agent-sdk/ (submodule)   │
│                                               │
│ Problem: git commands in submodule return     │
│          parent repo info (detached HEAD)     │
│                                               │
│ Solution:                                     │
│ 1. Extract submodule SHA from parent:         │
│    git rev-parse HEAD:vendor/sdk → a612c0a   │
│ 2. Set SDK_SHA=a612c0a                       │
│ 3. Build → Tags use correct submodule SHA    │
└───────────────────────────────────────────────┘

Co-authored-by: openhands [email protected]

@xingyaoww can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:a931bdc-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-a931bdc-python \
  ghcr.io/openhands/agent-server:a931bdc-python

All tags pushed for this build

ghcr.io/openhands/agent-server:a931bdc-golang-amd64
ghcr.io/openhands/agent-server:a931bdc-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:a931bdc-golang-arm64
ghcr.io/openhands/agent-server:a931bdc-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:a931bdc-java-amd64
ghcr.io/openhands/agent-server:a931bdc-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:a931bdc-java-arm64
ghcr.io/openhands/agent-server:a931bdc-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:a931bdc-python-amd64
ghcr.io/openhands/agent-server:a931bdc-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:a931bdc-python-arm64
ghcr.io/openhands/agent-server:a931bdc-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:a931bdc-golang
ghcr.io/openhands/agent-server:a931bdc-java
ghcr.io/openhands/agent-server:a931bdc-python

About Multi-Architecture Support

Each variant tag (e.g., a931bdc-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., a931bdc-python-amd64) are also available if needed

This PR enhances the Docker image tagging system with two key improvements: 1. **SDK_VERSION_OVERRIDE Environment Variable** - Allows overriding the SDK package version with a git commit hash - Critical for git submodule contexts where `importlib.metadata` returns the wrong version (package version instead of actual commit) - Enables reproducible builds by using exact commit hashes instead of semantic versions - Backward compatible: only used when environment variable is set 2. **Target-Based Tag Suffixes** (replaces `-dev` suffix) - All tags now include the build target as suffix: `-binary`, `-source`, `-binary-minimal`, `-source-minimal` - More descriptive than previous `-dev` suffix (which only applied to source builds) - Makes tag meaning immediately clear without needing to check build config - Removes deprecated `is_dev` property **Why SDK_VERSION_OVERRIDE is Needed:** When the SDK is used as a git submodule (vendored), running git commands inside the submodule directory doesn't work as expected because: - The submodule is in detached HEAD state - Git context points to parent repo, not submodule repo - `importlib.metadata.version()` returns the package version (e.g., 1.0.0) instead of the actual commit being used This environment variable allows the parent build system to: 1. Extract the actual submodule commit hash 2. Pass it to the build process via SDK_VERSION_OVERRIDE 3. Get reproducible image tags: `a612c0a-swebench-12155-binary` **Example Tag Outputs:** Before: - `v1.0.0_ubuntu-22.04_binary` (not reproducible) - `v1.0.0_ubuntu-22.04_source-dev` (unclear what `-dev` means) After: - `a612c0a-swebench-12155-binary` (exact commit + instance ID + target) - `main-swebench-12155-source` (clear target indication) **API Changes:** - New: `SDK_VERSION_OVERRIDE` env var support in `_sdk_version()` - New: `BuildOptions.include_versioned_tag: bool = True` field - Changed: All tags now include `-{target}` suffix - Removed: `BuildOptions.is_dev` property (deprecated) **Backward Compatibility:** - SDK_VERSION_OVERRIDE only used when explicitly set - `include_versioned_tag` defaults to True (existing behavior) - Tag suffix change is non-breaking (new format is more descriptive) **Related:** - OpenHands/benchmarks PR#51: Uses this feature for SWE-Bench image builds - Complements existing GITHUB_REF override mechanism Co-authored-by: openhands <[email protected]>

This commit improves the tagging system for SWE-Bench Docker images to enable better reproducibility and clarity. ## Changes ### 1. Benchmarks Build System **benchmarks/swe_bench/build_images.py:** - Added `get_sdk_commit_hash()`: Extracts 7-char SDK submodule commit hash - Added `extract_instance_id()`: Parses SWE-Bench base images to extract instance IDs - Modified `main()`: Sets SDK_VERSION_OVERRIDE env var with SDK commit hash - Modified `build_one()`: - Generates custom tags: `swebench-{instance_id}` - Disables versioned tags via `include_versioned_tag=False` ### 2. SDK Submodule Update **vendor/software-agent-sdk:** Updated to commit 77d50e61 which includes: - `SDK_VERSION_OVERRIDE` environment variable support - `include_versioned_tag` option in BuildOptions - Target-based tag suffixes (replaces `-dev` suffix) - See: OpenHands/software-agent-sdk#1088 ### 3. Documentation **TAGGING_CHANGES.md:** Comprehensive documentation explaining: - Why these changes are needed (submodule git context issues) - Tag format comparison (before/after) - Benefits (reproducibility, usability, maintainability) - Implementation details and examples ## Tag Format ### Before ``` v1.0.0_docker.io_s_swebench_s_sweb.eval.x86_64.django_1776_django-12155_tag_latest_source-minimal-dev ``` - 137 characters - Package version (non-reproducible) - Unclear `-dev` suffix ### After ``` a612c0a-swebench-django-12155-source-minimal main-swebench-django-12155-source-minimal ``` - 84 characters (39% shorter) - Exact commit hash (reproducible) - Clear target indication ## Benefits 1. **Reproducibility**: Git commit hash ensures exact SDK version tracking 2. **Clarity**: Instance ID and target clearly visible in tag 3. **Consistency**: All builds use same suffix pattern 4. **Backward Compatible**: SDK changes only apply when explicitly enabled ## Related - SDK PR: OpenHands/software-agent-sdk#1088 - Issue: Improve SWE-Bench image build workflow Co-authored-by: openhands <[email protected]>

Binary is the most common/default build target, so we don't need to add a suffix for it. This keeps tags cleaner for the standard case. Tag examples: - Binary: `a612c0a-custom-tag` (no suffix) - Source: `a612c0a-custom-tag-source` - Binary-minimal: `a612c0a-custom-tag-binary-minimal` - Source-minimal: `a612c0a-custom-tag-source-minimal` This follows the principle that the default/common case shouldn't require extra labeling. Co-authored-by: openhands <[email protected]>

SHORT_SHA was incorrectly using git info from the calling repo (e.g., benchmarks) instead of the SDK repo. Now uses SDK_VERSION which respects SDK_VERSION_OVERRIDE, ensuring tags use the correct SDK commit hash. This fixes the issue where tags would show benchmarks commit hash instead of SDK commit hash when building from a vendored submodule. Co-authored-by: openhands <[email protected]>

Previously SDK_VERSION required manual override via SDK_VERSION_OVERRIDE env var. This was fragile and required external coordination. Now SDK_VERSION automatically: - Finds the SDK repo root using existing _default_sdk_project_root() - Gets git commit from that specific directory (not CWD) - Works correctly in submodule/vendored contexts - Falls back to package version only if git unavailable Benefits: - No manual environment variable management needed - More robust - always gets correct SDK commit regardless of CWD - Simpler for consumers - just works out of the box - Eliminates the SHORT_SHA bug (was using wrong repo's commit) This makes the tagging system fully automatic and eliminates a class of potential bugs where tags would show the wrong commit hash. Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-11-06T21:13:13Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Agent Server

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1088 at branch `feat/improve-docker-image-tagging`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

xingyaoww · 2025-11-06T22:58:03Z

@OpenHands review diff against main, i think we should only tag version-based docker image when we are releasing a new version.

Can you review the workflows especially the one that publish to PyPI and see if we can add tagging "versioned" tag there - we can make a new job in that workflow that depends on server.yml and then additionally tag the sha-based image to version-based tag.

Help me clean up all existing logic of fetching package version for images tags -- instead we should go with a GIT_SHA centric logic.

openhands-ai · 2025-11-06T22:58:13Z

I'm on it! xingyaoww can track my progress at all-hands.dev

…or PyPI releases - Removed SDK_VERSION_OVERRIDE environment variable logic - Separated git SHA (SHORT_SHA) for commit-based tags from PACKAGE_VERSION for versioned tags - Changed include_versioned_tag default to False (only enable for releases) - Added --versioned-tag CLI flag to build.py for release builds - Added workflow_call trigger to server.yml to enable calling from other workflows - Added three new jobs to pypi-release.yml: 1. build-docker-images: calls server.yml to build SHA-based images 2. tag-versioned-images: pulls SHA images and tags with semantic version 3. create-manifest: creates multi-arch manifests for versioned tags This ensures: - Regular builds (PRs, main branch) only use git SHA-based tags for reproducibility - Release builds (PyPI publish) additionally tag images with semantic versions - No more confusion between package version and commit hash Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-11-06T23:08:27Z

Summary of Changes

I've successfully refactored the Docker image tagging system to use git SHA for regular builds and semantic versions only for PyPI releases, as requested.

✅ Changes Completed

1. build.py Refactoring (`openhands-agent-server/openhands/agent_server/docker/build.py`)

Removed SDK_VERSION_OVERRIDE environment variable logic entirely
Separated concerns:
- SHORT_SHA (from git commit) → used for commit-based tags
- PACKAGE_VERSION (from package metadata) → used for versioned tags only
Changed default: include_versioned_tag now defaults to False (was True)
Added CLI flag: --versioned-tag to explicitly enable versioned tags for release builds
Cleaned up: Removed complex _git_info_for_repo() function, simplified to git SHA-centric approach

2. server.yml Workflow (`.github/workflows/server.yml`)

Added workflow_call trigger to enable calling from pypi-release.yml
No changes to build behavior: Continues to build SHA-based tags only (since --versioned-tag flag is not passed)

3. pypi-release.yml Workflow (`.github/workflows/pypi-release.yml`)

Added permissions: packages: write for Docker registry access
Added package version extraction: Reads version from openhands-sdk/pyproject.toml
Added 3 new jobs (depends on publish job):
1. build-docker-images: Calls server.yml to build SHA-based images
2. tag-versioned-images: Pulls SHA images and tags them with semantic versions (e.g., v1.0.0_...)
3. create-manifest: Creates multi-arch manifests for versioned tags

🎯 Key Benefits

Clear separation of concerns: Git SHA for regular builds, semantic version only for PyPI releases
Better reproducibility: All builds use git SHA from actual commit (no more submodule confusion)
Efficiency: Release workflow reuses SHA-based images instead of rebuilding
Cleaner architecture: Removed the SDK_VERSION_OVERRIDE workaround

📊 Image Tag Examples

Regular builds (PRs, main branch):

ghcr.io/openhands/agent-server:85e436d-python
ghcr.io/openhands/agent-server:85e436d-java
ghcr.io/openhands/agent-server:85e436d-golang

PyPI release builds (additionally tagged):

ghcr.io/openhands/agent-server:v1.0.0_python-nodejs_tag_python3.12-nodejs22_binary
ghcr.io/openhands/agent-server:v1.0.0_eclipse-temurin_tag_17-jdk_binary
ghcr.io/openhands/agent-server:v1.0.0_golang_tag_1.21-bookworm_binary

All changes have been committed and pushed to the feat/improve-docker-image-tagging branch. Pre-commit checks passed successfully.

_{View full conversation}

…oach) Changes: - server.yml now triggers on 'v*' git tags (e.g., v1.0.0, v1.0.0-beta.1) - When triggered by tag, automatically includes --versioned-tag flag - Simplified pypi-release.yml to only publish to PyPI (removed Docker logic) - Removed workflow_call trigger (no longer needed) Benefits: - Standard practice: git tag → versioned Docker images - Decoupled concerns: PyPI publishing vs Docker image building - Simpler workflow: no pull-and-retag complexity - More flexible: can trigger versioned builds independently Release workflow: 1. Run pypi-release.yml to publish packages to PyPI 2. Create and push git tag (e.g., git tag v1.0.0 && git push origin v1.0.0) 3. server.yml automatically builds both SHA and versioned Docker images Co-authored-by: openhands <[email protected]>

Changes: - server.yml now triggers on any tag (e.g., 1.0.0, 1.0.0a5, build-docker) - When triggered by tag, automatically includes --versioned-tag flag - Removed workflow_call trigger (not needed) - Reverted pypi-release.yml to original (no Docker logic needed there) Benefits: - Simpler: just push a git tag to get versioned Docker images - Matches existing tag pattern (1.0.0, 1.0.0a5, etc.) - Decoupled: PyPI publishing and Docker image building are independent Release workflow: 1. Run pypi-release.yml to publish packages to PyPI 2. Push git tag: git tag 1.0.0 && git push origin 1.0.0 3. server.yml automatically builds both SHA-based and versioned Docker images Co-authored-by: openhands <[email protected]>

Add SDK_SHA and SDK_REF as higher-priority environment variable overrides for git commit SHA and ref detection. These take precedence over GITHUB_SHA and GITHUB_REF, enabling non-GitHub CI systems and submodule scenarios to provide explicit git information. Priority order: 1. SDK_SHA / SDK_REF (explicit override for any CI system) 2. GITHUB_SHA / GITHUB_REF (GitHub Actions environment) 3. git commands (local development) This allows parent repos to extract submodule commit hashes and pass them to SDK builds, solving the detached HEAD issue in git submodules. Co-authored-by: openhands <[email protected]>

github-actions · 2025-11-06T23:27:14Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-agent-server/openhands/agent_server/docker
build.py	350	277	20%	56, 58, 66, 68–69, 71–75, 77, 81, 84, 87–90, 92–94, 96, 98, 100, 102, 106–107, 122, 124–125, 127–128, 131–132, 134, 136–137, 140, 143–144, 148–149, 168–171, 176–177, 180–181, 194, 196–204, 216, 223–230, 234–236, 242, 246–254, 267, 269–271, 274–276, 288, 290–291, 293–298, 301, 344–346, 350, 354, 358, 362, 366–370, 372, 376–377, 380–381, 383–385, 388, 391–392, 395–397, 404–406, 408–412, 414, 418–421, 423, 427–428, 432–433, 436, 439–440, 443, 445–450, 452, 456–464, 469–473, 477, 485–487, 495–498, 500–501, 503–504, 506, 517–518, 520, 522–523, 526–528, 530, 532, 540, 544–546, 552, 556, 566–567, 569, 571, 576, 580, 582–590, 592–593, 595–598, 605–606, 611, 614, 619, 624, 629, 635, 640, 647–648, 653, 658, 664, 669, 678, 681–687, 690–692, 695, 708–716, 719–720, 724–727, 729, 738, 749, 752, 763–773, 775–776, 780
TOTAL	11725	5412	53%

xingyaoww · 2025-11-07T15:38:34Z

tests/agent_server/test_docker_build.py

@OpenHands, please review the diff against main for the openhands-agent-server/openhands/agent_server/docker/build.py file and make sure all tests in this test_docker_build.py file match the implementation (do not change the implementation).

I'm on it! xingyaoww can track my progress at all-hands.dev

Review Complete: All Tests Match Implementation ✅

I've completed a thorough review of the tests in tests/agent_server/test_docker_build.py against the implementation in openhands-agent-server/openhands/agent_server/docker/build.py.

Summary

Result: ✅ NO CHANGES NEEDED

All 6 tests correctly match the implementation and validate the SDK_SHA/SDK_REF priority feature introduced in PR #1088.

Test Coverage Validation

All tests properly verify the priority order for git information:

SHA Priority (correctly tested):

✅ SDK_SHA takes priority (test_git_info_priority_sdk_sha)

✅ GITHUB_SHA used when SDK_SHA not set (test_git_info_priority_github_sha)

✅ Empty SDK_SHA falls back to GITHUB_SHA (test_git_info_empty_sdk_sha_falls_back)

REF Priority (correctly tested):

✅ SDK_REF takes priority (test_git_info_priority_sdk_ref)

✅ GITHUB_REF used when SDK_REF not set (test_git_info_priority_github_ref)

Use Cases:

✅ Submodule scenario validated (test_git_info_submodule_scenario)

Quality Checks Passed

✅ All 6 tests pass (0.03s execution time)

✅ Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)

✅ Test isolation verified (tests pass individually and in different orders)

✅ Proper mocking (git commands not called when env vars set)

✅ Edge cases covered (empty string fallback behavior)

Key Findings

The implementation uses Python's or operator for priority:

git_sha = os.environ.get("SDK_SHA") or os.environ.get("GITHUB_SHA") git_ref = os.environ.get("SDK_REF") or os.environ.get("GITHUB_REF")

Tests correctly validate:

Priority order (SDK_* > GITHUB_* > git commands)

Empty string fallback behavior (empty strings are falsy)

Submodule use case (main feature from PR description)

Short SHA generation ([:7] slicing)

No changes were required or made to the codebase. The tests are well-written and accurately reflect the implementation behavior.

_{View full conversation}

Add 7 new test cases covering: - Short image names (no truncation) - Images without tags - Long images with tags (truncation) - Long images without tags (truncation) - Custom max_len parameter - Digest consistency and determinism - Edge case of exact max_len All tests validate: - Length constraints are respected - Digest suffix format (13 chars: '-' + 12 hex) - Identifiable parts (repo name/tag) are preserved - Deterministic behavior for same inputs Co-authored-by: openhands <[email protected]>

neubig

I'm not super-familiar with standards around how these should be named, but everything here seems reasonable to me so I'll approve!

.github/workflows/pypi-release.yml

) * Add GitHub workflow for building SWE-Bench images with Blacksmith caching - Create workflow that can be manually triggered via workflow_dispatch - Integrate Blacksmith caching for faster Docker builds - Configure workflow to push images to ghcr.io/openhands/eval-agent-server - Make --critic parameter optional in build_images.py for build-only usage - Fix .gitignore patterns for eval_outputs and builds directories This workflow follows Blacksmith documentation for Docker builds and allows building SWE-Bench evaluation images with configurable parameters like dataset, split, target, platforms, and concurrent workers. Closes #37 * Use Blacksmith's setup-docker-builder action for faster Docker layer caching Following the pattern from OpenHands/software-agent-sdk#990 and Blacksmith's official documentation (https://docs.blacksmith.sh/blacksmith-caching/docker-builds), this change replaces the standard docker/setup-buildx-action with useblacksmith/setup-docker-builder@v1. Key improvements: - Replaces docker/setup-buildx-action@v3 with useblacksmith/setup-docker-builder@v1 - Removes manual cache configuration (useblacksmith/cache@v6) - Blacksmith's Docker builder automatically manages Docker layer caching via NVMe-backed sticky disks - Provides 2x to 40x improvements in build times according to Blacksmith's customers - Since we only build amd64 images, we don't need the complex multi-platform matrix strategy This approach is recommended for workflows that use Docker commands directly (as opposed to using docker/build-push-action). Co-authored-by: openhands <[email protected]> * revert unneed stuff * simplify setup dependency * set eval-agent-server * fix line break * default to 10 for testing * run on all prs for debugging * Fix pyarrow build issue by forcing binary wheel installation The GitHub Actions workflow was failing because uv was trying to build pyarrow from source, which requires the Arrow C++ library and CMake. This change adds the --no-build-package pyarrow flag to force uv to use the pre-built binary wheel instead of attempting to build from source. Co-authored-by: openhands <[email protected]> * Pin Python version to 3.12 to fix pyarrow compatibility The root cause of the build failure was that uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow 21.0.0 yet. This caused uv to attempt building from source, which failed due to missing Arrow C++ libraries. Solution: Added .python-version file to pin Python to 3.12, which matches the project's target-version in pyproject.toml and has full binary wheel support for all dependencies. Co-authored-by: openhands <[email protected]> * Fix artifact upload naming to avoid invalid characters Use github.run_id instead of dataset/split names which contain slashes that are invalid in artifact names. Also added if-no-files-found: warn to provide better feedback if logs are missing. Co-authored-by: openhands <[email protected]> * Fix artifact upload by archiving logs to avoid invalid filename characters GitHub Actions artifact upload doesn't allow colons in filenames, but our log paths contain colons from Docker image tags (e.g., 'django-11999:latest'). Archive the entire builds directory into a tar.gz before upload to work around this restriction. Co-authored-by: openhands <[email protected]> * Fix Docker cache tag length exceeding 128 character limit Docker image tags have a maximum length of 128 characters. When building SWE-Bench images with long base image names (e.g., scikit-learn), the generated cache tags exceed this limit and cause build failures with: 'ERROR: failed to configure registry cache exporter: invalid reference format' Solution: Apply a patch to vendor/software-agent-sdk that hashes the base_image_slug when it would cause the final tag to exceed 128 characters. Uses SHA256 hash (first 12 chars) to create a shorter unique identifier while maintaining cache efficiency. The patch is applied during the workflow setup before installing dependencies. Co-authored-by: openhands <[email protected]> * Update patch with pre-commit formatting fixes Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <[email protected]> * checkout to v1.0.0 of sdk * update uv.lock * Revert "Fix Docker cache tag length exceeding 128 character limit" This reverts commit 3ba1e46. * Fix log file mixing issue by using ProcessPoolExecutor The build workflow was experiencing log file corruption and I/O errors due to concurrent builds writing to the wrong log files. This was caused by using ThreadPoolExecutor with contextlib.redirect_stdout/stderr, which only provides thread-local redirection of Python-level writes. The SDK's build() function spawns subprocesses and uses logger.info()/warning() to output build logs. Logger handlers write to process-wide file descriptors, not thread-local redirected streams, causing output from concurrent threads to: - Write to the wrong log files - Attempt writing to closed file handles - Result in ValueError('I/O operation on closed file.') Solution: Replace ThreadPoolExecutor with ProcessPoolExecutor to provide complete process-level isolation with separate stdout/stderr/logging per build. The additional overhead is negligible compared to Docker build time. Changes: - Import ProcessPoolExecutor instead of ThreadPoolExecutor - Move build_one_fn to module level (_build_with_logging) for pickle support - Update executor initialization to use ProcessPoolExecutor - Add explanatory comments about isolation requirements Co-authored-by: openhands <[email protected]> * Improve Docker image tagging for reproducibility This commit improves the tagging system for SWE-Bench Docker images to enable better reproducibility and clarity. ## Changes ### 1. Benchmarks Build System **benchmarks/swe_bench/build_images.py:** - Added `get_sdk_commit_hash()`: Extracts 7-char SDK submodule commit hash - Added `extract_instance_id()`: Parses SWE-Bench base images to extract instance IDs - Modified `main()`: Sets SDK_VERSION_OVERRIDE env var with SDK commit hash - Modified `build_one()`: - Generates custom tags: `swebench-{instance_id}` - Disables versioned tags via `include_versioned_tag=False` ### 2. SDK Submodule Update **vendor/software-agent-sdk:** Updated to commit 77d50e61 which includes: - `SDK_VERSION_OVERRIDE` environment variable support - `include_versioned_tag` option in BuildOptions - Target-based tag suffixes (replaces `-dev` suffix) - See: OpenHands/software-agent-sdk#1088 ### 3. Documentation **TAGGING_CHANGES.md:** Comprehensive documentation explaining: - Why these changes are needed (submodule git context issues) - Tag format comparison (before/after) - Benefits (reproducibility, usability, maintainability) - Implementation details and examples ## Tag Format ### Before ``` v1.0.0_docker.io_s_swebench_s_sweb.eval.x86_64.django_1776_django-12155_tag_latest_source-minimal-dev ``` - 137 characters - Package version (non-reproducible) - Unclear `-dev` suffix ### After ``` a612c0a-swebench-django-12155-source-minimal main-swebench-django-12155-source-minimal ``` - 84 characters (39% shorter) - Exact commit hash (reproducible) - Clear target indication ## Benefits 1. **Reproducibility**: Git commit hash ensures exact SDK version tracking 2. **Clarity**: Instance ID and target clearly visible in tag 3. **Consistency**: All builds use same suffix pattern 4. **Backward Compatible**: SDK changes only apply when explicitly enabled ## Related - SDK PR: OpenHands/software-agent-sdk#1088 - Issue: Improve SWE-Bench image build workflow Co-authored-by: openhands <[email protected]> * refactor: omit target suffix for binary builds (default case) Updated SDK submodule to bc25aa0d which omits the target suffix for binary builds since it's the default/common case. This keeps tags cleaner. Tag examples: - Binary: a612c0a-swebench-django-12155 (no suffix) - Source: a612c0a-swebench-django-12155-source - Source-minimal: a612c0a-swebench-django-12155-source-minimal Updated TAGGING_CHANGES.md to reflect this behavior with updated examples showing both binary and source-minimal formats. Co-authored-by: openhands <[email protected]> * fix: update SDK to use SDK_VERSION for commit tags Updates SDK submodule to 27f37dc0 which fixes an issue where SHORT_SHA was using git info from the benchmarks repo instead of the SDK repo. Now tags correctly use the SDK commit hash when SDK_VERSION_OVERRIDE is set, ensuring proper versioning in vendored/submodule contexts. Co-authored-by: openhands <[email protected]> * refactor: remove SDK_VERSION_OVERRIDE logic SDK now automatically detects its own commit hash, so we don't need to manually extract and override it. This simplifies the build script significantly: - Removed get_sdk_commit_hash() function - Removed SDK_VERSION_OVERRIDE env var setting - Removed unused imports (subprocess, os) - Updated documentation to reflect simpler approach The SDK's _sdk_version() now automatically finds the SDK repo root and gets the commit hash directly, regardless of whether it's used as a submodule or vendored dependency. Co-authored-by: openhands <[email protected]> * chore: update SDK to commit 85e436df Update SDK submodule to include automatic SDK_VERSION detection. SDK now auto-detects its own commit hash without requiring external override, making the tagging system fully automatic. Co-authored-by: openhands <[email protected]> * update agent-sdk version * improve custom tags for swebench image * Revert "update agent-sdk version" This reverts commit 8d8ed8c. * update sha * fix: update run_infer.py to use new SDK tag format - Replace SDK_VERSION with SHORT_SHA (renamed in SDK PR #1088) - Add extract_custom_tag() function to avoid circular import - Update get_agent_server_docker_image() to use new tag format: - Binary target: {SHORT_SHA}-{custom_tag} - Other targets: {SHORT_SHA}-{custom_tag}-{target} - Aligns with SDK's git commit-based tagging strategy Co-authored-by: openhands <[email protected]> * refactor: deduplicate extract_custom_tag by importing from run_infer Remove duplicate implementation of extract_custom_tag in build_images.py and import it from run_infer.py instead. This avoids code duplication and ensures both modules use the same implementation. Co-authored-by: openhands <[email protected]> * docs: clarify SHORT_SHA source in run_infer.py Add comment explaining that SHORT_SHA is computed from the benchmarks repo's git commit (via git rev-parse HEAD in cwd), not the SDK submodule. This makes it clear that images are tagged with the benchmarks repo commit for reproducibility and traceability. Co-authored-by: openhands <[email protected]> * update sdk * refactor * remove tagging changes * bump commit * simplify build script * bump version * bump * bump * refactor build util into shared file * simplify build on the fly logic * remove targets and platform * Add automatic comment to issue #81 on successful build This adds a new step to the build-and-push workflow that: - Posts a comment to issue #81 when the build completes successfully - Includes dataset name, split, SDK version, and workflow run link - Lists all built image tags in a collapsible markdown section Co-authored-by: openhands <[email protected]> * Fix SDK URL and add workflow trigger information - Corrected SDK repository URL from All-Hands-AI/agent-sdk to OpenHands/software-agent-sdk - Added 'Triggered by' field to comment to show workflow trigger source - Updated .openhands/microagents/repo.md with correct SDK URL Co-authored-by: openhands <[email protected]> * Update .gitignore to properly allow .openhands/microagents/ Changed .openhands/ to .openhands/* so that negation patterns work correctly Co-authored-by: openhands <[email protected]> * Add error handling to skip comment when no images are built The comment step now checks if manifest.jsonl files exist and contain data before attempting to post a comment. This prevents posting comments with '0 images' when builds complete successfully but produce no output (e.g., during PR testing or when the build step is skipped). Co-authored-by: openhands <[email protected]> * Fix manifest file path detection using find command The previous check using 'builds/*/manifest.jsonl' only looked one level deep, but the actual path is 'builds/princeton-nlp/SWE-bench_Verified/test/manifest.jsonl' which is three levels deep. Using 'find' command now correctly locates manifest files at any depth within the builds directory. Tested with actual artifact from run #19182998503 containing 10 images. Co-authored-by: openhands <[email protected]> * bump sdk * increase n work and n limit * Show only one tag per image in issue comment Each image has multiple tags (base tag + detailed tag with hash). Now showing only the first (cleaner) tag per image to reduce clutter in the issue comment, making it easier to read. Co-authored-by: openhands <[email protected]> * bump sdk commit * increase to 500 limit and 32 concurrency * disable rebuild on every push * Fix workflow summary mismatch: use manifest.jsonl instead of summary.json The new builder in build_utils.py only writes manifest.jsonl, not summary.json. This commit updates the workflow to: - Remove summary.json from artifact upload path - Generate build summary from manifest.jsonl instead of summary.json - Display total/successful/failed counts and list failed builds Co-authored-by: openhands <[email protected]> * Remove redundant 'Upload build manifest' step The 'Archive build logs' step already packages the entire builds/ directory (including manifest.jsonl files) into build-logs.tar.gz, so a separate step to upload manifest.jsonl is redundant. Co-authored-by: openhands <[email protected]> * Update .github/workflows/build-swe-bench-images.yml * Update .github/workflows/build-swe-bench-images.yml --------- Co-authored-by: openhands <[email protected]>

* Add GitHub workflow for building SWE-Bench images with Blacksmith caching - Create workflow that can be manually triggered via workflow_dispatch - Integrate Blacksmith caching for faster Docker builds - Configure workflow to push images to ghcr.io/openhands/eval-agent-server - Make --critic parameter optional in build_images.py for build-only usage - Fix .gitignore patterns for eval_outputs and builds directories This workflow follows Blacksmith documentation for Docker builds and allows building SWE-Bench evaluation images with configurable parameters like dataset, split, target, platforms, and concurrent workers. Closes #37 * Use Blacksmith's setup-docker-builder action for faster Docker layer caching Following the pattern from OpenHands/software-agent-sdk#990 and Blacksmith's official documentation (https://docs.blacksmith.sh/blacksmith-caching/docker-builds), this change replaces the standard docker/setup-buildx-action with useblacksmith/setup-docker-builder@v1. Key improvements: - Replaces docker/setup-buildx-action@v3 with useblacksmith/setup-docker-builder@v1 - Removes manual cache configuration (useblacksmith/cache@v6) - Blacksmith's Docker builder automatically manages Docker layer caching via NVMe-backed sticky disks - Provides 2x to 40x improvements in build times according to Blacksmith's customers - Since we only build amd64 images, we don't need the complex multi-platform matrix strategy This approach is recommended for workflows that use Docker commands directly (as opposed to using docker/build-push-action). Co-authored-by: openhands <[email protected]> * revert unneed stuff * simplify setup dependency * set eval-agent-server * fix line break * default to 10 for testing * run on all prs for debugging * Fix pyarrow build issue by forcing binary wheel installation The GitHub Actions workflow was failing because uv was trying to build pyarrow from source, which requires the Arrow C++ library and CMake. This change adds the --no-build-package pyarrow flag to force uv to use the pre-built binary wheel instead of attempting to build from source. Co-authored-by: openhands <[email protected]> * Pin Python version to 3.12 to fix pyarrow compatibility The root cause of the build failure was that uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow 21.0.0 yet. This caused uv to attempt building from source, which failed due to missing Arrow C++ libraries. Solution: Added .python-version file to pin Python to 3.12, which matches the project's target-version in pyproject.toml and has full binary wheel support for all dependencies. Co-authored-by: openhands <[email protected]> * Fix artifact upload naming to avoid invalid characters Use github.run_id instead of dataset/split names which contain slashes that are invalid in artifact names. Also added if-no-files-found: warn to provide better feedback if logs are missing. Co-authored-by: openhands <[email protected]> * Fix artifact upload by archiving logs to avoid invalid filename characters GitHub Actions artifact upload doesn't allow colons in filenames, but our log paths contain colons from Docker image tags (e.g., 'django-11999:latest'). Archive the entire builds directory into a tar.gz before upload to work around this restriction. Co-authored-by: openhands <[email protected]> * Fix Docker cache tag length exceeding 128 character limit Docker image tags have a maximum length of 128 characters. When building SWE-Bench images with long base image names (e.g., scikit-learn), the generated cache tags exceed this limit and cause build failures with: 'ERROR: failed to configure registry cache exporter: invalid reference format' Solution: Apply a patch to vendor/software-agent-sdk that hashes the base_image_slug when it would cause the final tag to exceed 128 characters. Uses SHA256 hash (first 12 chars) to create a shorter unique identifier while maintaining cache efficiency. The patch is applied during the workflow setup before installing dependencies. Co-authored-by: openhands <[email protected]> * Update patch with pre-commit formatting fixes Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <[email protected]> * checkout to v1.0.0 of sdk * update uv.lock * Revert "Fix Docker cache tag length exceeding 128 character limit" This reverts commit 3ba1e46. * Fix log file mixing issue by using ProcessPoolExecutor The build workflow was experiencing log file corruption and I/O errors due to concurrent builds writing to the wrong log files. This was caused by using ThreadPoolExecutor with contextlib.redirect_stdout/stderr, which only provides thread-local redirection of Python-level writes. The SDK's build() function spawns subprocesses and uses logger.info()/warning() to output build logs. Logger handlers write to process-wide file descriptors, not thread-local redirected streams, causing output from concurrent threads to: - Write to the wrong log files - Attempt writing to closed file handles - Result in ValueError('I/O operation on closed file.') Solution: Replace ThreadPoolExecutor with ProcessPoolExecutor to provide complete process-level isolation with separate stdout/stderr/logging per build. The additional overhead is negligible compared to Docker build time. Changes: - Import ProcessPoolExecutor instead of ThreadPoolExecutor - Move build_one_fn to module level (_build_with_logging) for pickle support - Update executor initialization to use ProcessPoolExecutor - Add explanatory comments about isolation requirements Co-authored-by: openhands <[email protected]> * Improve Docker image tagging for reproducibility This commit improves the tagging system for SWE-Bench Docker images to enable better reproducibility and clarity. ## Changes ### 1. Benchmarks Build System **benchmarks/swe_bench/build_images.py:** - Added `get_sdk_commit_hash()`: Extracts 7-char SDK submodule commit hash - Added `extract_instance_id()`: Parses SWE-Bench base images to extract instance IDs - Modified `main()`: Sets SDK_VERSION_OVERRIDE env var with SDK commit hash - Modified `build_one()`: - Generates custom tags: `swebench-{instance_id}` - Disables versioned tags via `include_versioned_tag=False` ### 2. SDK Submodule Update **vendor/software-agent-sdk:** Updated to commit 77d50e61 which includes: - `SDK_VERSION_OVERRIDE` environment variable support - `include_versioned_tag` option in BuildOptions - Target-based tag suffixes (replaces `-dev` suffix) - See: OpenHands/software-agent-sdk#1088 ### 3. Documentation **TAGGING_CHANGES.md:** Comprehensive documentation explaining: - Why these changes are needed (submodule git context issues) - Tag format comparison (before/after) - Benefits (reproducibility, usability, maintainability) - Implementation details and examples ## Tag Format ### Before ``` v1.0.0_docker.io_s_swebench_s_sweb.eval.x86_64.django_1776_django-12155_tag_latest_source-minimal-dev ``` - 137 characters - Package version (non-reproducible) - Unclear `-dev` suffix ### After ``` a612c0a-swebench-django-12155-source-minimal main-swebench-django-12155-source-minimal ``` - 84 characters (39% shorter) - Exact commit hash (reproducible) - Clear target indication ## Benefits 1. **Reproducibility**: Git commit hash ensures exact SDK version tracking 2. **Clarity**: Instance ID and target clearly visible in tag 3. **Consistency**: All builds use same suffix pattern 4. **Backward Compatible**: SDK changes only apply when explicitly enabled ## Related - SDK PR: OpenHands/software-agent-sdk#1088 - Issue: Improve SWE-Bench image build workflow Co-authored-by: openhands <[email protected]> * refactor: omit target suffix for binary builds (default case) Updated SDK submodule to bc25aa0d which omits the target suffix for binary builds since it's the default/common case. This keeps tags cleaner. Tag examples: - Binary: a612c0a-swebench-django-12155 (no suffix) - Source: a612c0a-swebench-django-12155-source - Source-minimal: a612c0a-swebench-django-12155-source-minimal Updated TAGGING_CHANGES.md to reflect this behavior with updated examples showing both binary and source-minimal formats. Co-authored-by: openhands <[email protected]> * fix: update SDK to use SDK_VERSION for commit tags Updates SDK submodule to 27f37dc0 which fixes an issue where SHORT_SHA was using git info from the benchmarks repo instead of the SDK repo. Now tags correctly use the SDK commit hash when SDK_VERSION_OVERRIDE is set, ensuring proper versioning in vendored/submodule contexts. Co-authored-by: openhands <[email protected]> * refactor: remove SDK_VERSION_OVERRIDE logic SDK now automatically detects its own commit hash, so we don't need to manually extract and override it. This simplifies the build script significantly: - Removed get_sdk_commit_hash() function - Removed SDK_VERSION_OVERRIDE env var setting - Removed unused imports (subprocess, os) - Updated documentation to reflect simpler approach The SDK's _sdk_version() now automatically finds the SDK repo root and gets the commit hash directly, regardless of whether it's used as a submodule or vendored dependency. Co-authored-by: openhands <[email protected]> * chore: update SDK to commit 85e436df Update SDK submodule to include automatic SDK_VERSION detection. SDK now auto-detects its own commit hash without requiring external override, making the tagging system fully automatic. Co-authored-by: openhands <[email protected]> * update agent-sdk version * improve custom tags for swebench image * Revert "update agent-sdk version" This reverts commit 8d8ed8c. * update sha * fix: update run_infer.py to use new SDK tag format - Replace SDK_VERSION with SHORT_SHA (renamed in SDK PR #1088) - Add extract_custom_tag() function to avoid circular import - Update get_agent_server_docker_image() to use new tag format: - Binary target: {SHORT_SHA}-{custom_tag} - Other targets: {SHORT_SHA}-{custom_tag}-{target} - Aligns with SDK's git commit-based tagging strategy Co-authored-by: openhands <[email protected]> * refactor: deduplicate extract_custom_tag by importing from run_infer Remove duplicate implementation of extract_custom_tag in build_images.py and import it from run_infer.py instead. This avoids code duplication and ensures both modules use the same implementation. Co-authored-by: openhands <[email protected]> * docs: clarify SHORT_SHA source in run_infer.py Add comment explaining that SHORT_SHA is computed from the benchmarks repo's git commit (via git rev-parse HEAD in cwd), not the SDK submodule. This makes it clear that images are tagged with the benchmarks repo commit for reproducibility and traceability. Co-authored-by: openhands <[email protected]> * update sdk * refactor * remove tagging changes * bump commit * simplify build script * bump version * bump * bump * refactor build util into shared file * simplify build on the fly logic * remove targets and platform * Add automatic comment to issue #81 on successful build This adds a new step to the build-and-push workflow that: - Posts a comment to issue #81 when the build completes successfully - Includes dataset name, split, SDK version, and workflow run link - Lists all built image tags in a collapsible markdown section Co-authored-by: openhands <[email protected]> * Fix SDK URL and add workflow trigger information - Corrected SDK repository URL from All-Hands-AI/agent-sdk to OpenHands/software-agent-sdk - Added 'Triggered by' field to comment to show workflow trigger source - Updated .openhands/microagents/repo.md with correct SDK URL Co-authored-by: openhands <[email protected]> * Update .gitignore to properly allow .openhands/microagents/ Changed .openhands/ to .openhands/* so that negation patterns work correctly Co-authored-by: openhands <[email protected]> * Add error handling to skip comment when no images are built The comment step now checks if manifest.jsonl files exist and contain data before attempting to post a comment. This prevents posting comments with '0 images' when builds complete successfully but produce no output (e.g., during PR testing or when the build step is skipped). Co-authored-by: openhands <[email protected]> * Fix manifest file path detection using find command The previous check using 'builds/*/manifest.jsonl' only looked one level deep, but the actual path is 'builds/princeton-nlp/SWE-bench_Verified/test/manifest.jsonl' which is three levels deep. Using 'find' command now correctly locates manifest files at any depth within the builds directory. Tested with actual artifact from run #19182998503 containing 10 images. Co-authored-by: openhands <[email protected]> * bump sdk * increase n work and n limit * Show only one tag per image in issue comment Each image has multiple tags (base tag + detailed tag with hash). Now showing only the first (cleaner) tag per image to reduce clutter in the issue comment, making it easier to read. Co-authored-by: openhands <[email protected]> * bump sdk commit * increase to 500 limit and 32 concurrency * disable rebuild on every push * Fix workflow summary mismatch: use manifest.jsonl instead of summary.json The new builder in build_utils.py only writes manifest.jsonl, not summary.json. This commit updates the workflow to: - Remove summary.json from artifact upload path - Generate build summary from manifest.jsonl instead of summary.json - Display total/successful/failed counts and list failed builds Co-authored-by: openhands <[email protected]> * Remove redundant 'Upload build manifest' step The 'Archive build logs' step already packages the entire builds/ directory (including manifest.jsonl files) into build-logs.tar.gz, so a separate step to upload manifest.jsonl is redundant. Co-authored-by: openhands <[email protected]> * bump sdk to v1.1 * support remote runtime & bump ver again * fix target type * bump sdk * check image exists before launching remote runtime job * trying fixing docker build trigger * fix typo * tweak * tweak * drop default * sleep after failure * check target image existence before build --------- Co-authored-by: openhands <[email protected]>

…ture, and proper Ctrl+C cleanup (#93) * Add GitHub workflow for building SWE-Bench images with Blacksmith caching - Create workflow that can be manually triggered via workflow_dispatch - Integrate Blacksmith caching for faster Docker builds - Configure workflow to push images to ghcr.io/openhands/eval-agent-server - Make --critic parameter optional in build_images.py for build-only usage - Fix .gitignore patterns for eval_outputs and builds directories This workflow follows Blacksmith documentation for Docker builds and allows building SWE-Bench evaluation images with configurable parameters like dataset, split, target, platforms, and concurrent workers. Closes #37 * Use Blacksmith's setup-docker-builder action for faster Docker layer caching Following the pattern from OpenHands/software-agent-sdk#990 and Blacksmith's official documentation (https://docs.blacksmith.sh/blacksmith-caching/docker-builds), this change replaces the standard docker/setup-buildx-action with useblacksmith/setup-docker-builder@v1. Key improvements: - Replaces docker/setup-buildx-action@v3 with useblacksmith/setup-docker-builder@v1 - Removes manual cache configuration (useblacksmith/cache@v6) - Blacksmith's Docker builder automatically manages Docker layer caching via NVMe-backed sticky disks - Provides 2x to 40x improvements in build times according to Blacksmith's customers - Since we only build amd64 images, we don't need the complex multi-platform matrix strategy This approach is recommended for workflows that use Docker commands directly (as opposed to using docker/build-push-action). Co-authored-by: openhands <[email protected]> * revert unneed stuff * simplify setup dependency * set eval-agent-server * fix line break * default to 10 for testing * run on all prs for debugging * Fix pyarrow build issue by forcing binary wheel installation The GitHub Actions workflow was failing because uv was trying to build pyarrow from source, which requires the Arrow C++ library and CMake. This change adds the --no-build-package pyarrow flag to force uv to use the pre-built binary wheel instead of attempting to build from source. Co-authored-by: openhands <[email protected]> * Pin Python version to 3.12 to fix pyarrow compatibility The root cause of the build failure was that uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow 21.0.0 yet. This caused uv to attempt building from source, which failed due to missing Arrow C++ libraries. Solution: Added .python-version file to pin Python to 3.12, which matches the project's target-version in pyproject.toml and has full binary wheel support for all dependencies. Co-authored-by: openhands <[email protected]> * Fix artifact upload naming to avoid invalid characters Use github.run_id instead of dataset/split names which contain slashes that are invalid in artifact names. Also added if-no-files-found: warn to provide better feedback if logs are missing. Co-authored-by: openhands <[email protected]> * Fix artifact upload by archiving logs to avoid invalid filename characters GitHub Actions artifact upload doesn't allow colons in filenames, but our log paths contain colons from Docker image tags (e.g., 'django-11999:latest'). Archive the entire builds directory into a tar.gz before upload to work around this restriction. Co-authored-by: openhands <[email protected]> * Fix Docker cache tag length exceeding 128 character limit Docker image tags have a maximum length of 128 characters. When building SWE-Bench images with long base image names (e.g., scikit-learn), the generated cache tags exceed this limit and cause build failures with: 'ERROR: failed to configure registry cache exporter: invalid reference format' Solution: Apply a patch to vendor/software-agent-sdk that hashes the base_image_slug when it would cause the final tag to exceed 128 characters. Uses SHA256 hash (first 12 chars) to create a shorter unique identifier while maintaining cache efficiency. The patch is applied during the workflow setup before installing dependencies. Co-authored-by: openhands <[email protected]> * Update patch with pre-commit formatting fixes Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <[email protected]> * checkout to v1.0.0 of sdk * update uv.lock * Revert "Fix Docker cache tag length exceeding 128 character limit" This reverts commit 3ba1e46. * Fix log file mixing issue by using ProcessPoolExecutor The build workflow was experiencing log file corruption and I/O errors due to concurrent builds writing to the wrong log files. This was caused by using ThreadPoolExecutor with contextlib.redirect_stdout/stderr, which only provides thread-local redirection of Python-level writes. The SDK's build() function spawns subprocesses and uses logger.info()/warning() to output build logs. Logger handlers write to process-wide file descriptors, not thread-local redirected streams, causing output from concurrent threads to: - Write to the wrong log files - Attempt writing to closed file handles - Result in ValueError('I/O operation on closed file.') Solution: Replace ThreadPoolExecutor with ProcessPoolExecutor to provide complete process-level isolation with separate stdout/stderr/logging per build. The additional overhead is negligible compared to Docker build time. Changes: - Import ProcessPoolExecutor instead of ThreadPoolExecutor - Move build_one_fn to module level (_build_with_logging) for pickle support - Update executor initialization to use ProcessPoolExecutor - Add explanatory comments about isolation requirements Co-authored-by: openhands <[email protected]> * Improve Docker image tagging for reproducibility This commit improves the tagging system for SWE-Bench Docker images to enable better reproducibility and clarity. ## Changes ### 1. Benchmarks Build System **benchmarks/swe_bench/build_images.py:** - Added `get_sdk_commit_hash()`: Extracts 7-char SDK submodule commit hash - Added `extract_instance_id()`: Parses SWE-Bench base images to extract instance IDs - Modified `main()`: Sets SDK_VERSION_OVERRIDE env var with SDK commit hash - Modified `build_one()`: - Generates custom tags: `swebench-{instance_id}` - Disables versioned tags via `include_versioned_tag=False` ### 2. SDK Submodule Update **vendor/software-agent-sdk:** Updated to commit 77d50e61 which includes: - `SDK_VERSION_OVERRIDE` environment variable support - `include_versioned_tag` option in BuildOptions - Target-based tag suffixes (replaces `-dev` suffix) - See: OpenHands/software-agent-sdk#1088 ### 3. Documentation **TAGGING_CHANGES.md:** Comprehensive documentation explaining: - Why these changes are needed (submodule git context issues) - Tag format comparison (before/after) - Benefits (reproducibility, usability, maintainability) - Implementation details and examples ## Tag Format ### Before ``` v1.0.0_docker.io_s_swebench_s_sweb.eval.x86_64.django_1776_django-12155_tag_latest_source-minimal-dev ``` - 137 characters - Package version (non-reproducible) - Unclear `-dev` suffix ### After ``` a612c0a-swebench-django-12155-source-minimal main-swebench-django-12155-source-minimal ``` - 84 characters (39% shorter) - Exact commit hash (reproducible) - Clear target indication ## Benefits 1. **Reproducibility**: Git commit hash ensures exact SDK version tracking 2. **Clarity**: Instance ID and target clearly visible in tag 3. **Consistency**: All builds use same suffix pattern 4. **Backward Compatible**: SDK changes only apply when explicitly enabled ## Related - SDK PR: OpenHands/software-agent-sdk#1088 - Issue: Improve SWE-Bench image build workflow Co-authored-by: openhands <[email protected]> * refactor: omit target suffix for binary builds (default case) Updated SDK submodule to bc25aa0d which omits the target suffix for binary builds since it's the default/common case. This keeps tags cleaner. Tag examples: - Binary: a612c0a-swebench-django-12155 (no suffix) - Source: a612c0a-swebench-django-12155-source - Source-minimal: a612c0a-swebench-django-12155-source-minimal Updated TAGGING_CHANGES.md to reflect this behavior with updated examples showing both binary and source-minimal formats. Co-authored-by: openhands <[email protected]> * fix: update SDK to use SDK_VERSION for commit tags Updates SDK submodule to 27f37dc0 which fixes an issue where SHORT_SHA was using git info from the benchmarks repo instead of the SDK repo. Now tags correctly use the SDK commit hash when SDK_VERSION_OVERRIDE is set, ensuring proper versioning in vendored/submodule contexts. Co-authored-by: openhands <[email protected]> * refactor: remove SDK_VERSION_OVERRIDE logic SDK now automatically detects its own commit hash, so we don't need to manually extract and override it. This simplifies the build script significantly: - Removed get_sdk_commit_hash() function - Removed SDK_VERSION_OVERRIDE env var setting - Removed unused imports (subprocess, os) - Updated documentation to reflect simpler approach The SDK's _sdk_version() now automatically finds the SDK repo root and gets the commit hash directly, regardless of whether it's used as a submodule or vendored dependency. Co-authored-by: openhands <[email protected]> * chore: update SDK to commit 85e436df Update SDK submodule to include automatic SDK_VERSION detection. SDK now auto-detects its own commit hash without requiring external override, making the tagging system fully automatic. Co-authored-by: openhands <[email protected]> * update agent-sdk version * improve custom tags for swebench image * Revert "update agent-sdk version" This reverts commit 8d8ed8c. * update sha * fix: update run_infer.py to use new SDK tag format - Replace SDK_VERSION with SHORT_SHA (renamed in SDK PR #1088) - Add extract_custom_tag() function to avoid circular import - Update get_agent_server_docker_image() to use new tag format: - Binary target: {SHORT_SHA}-{custom_tag} - Other targets: {SHORT_SHA}-{custom_tag}-{target} - Aligns with SDK's git commit-based tagging strategy Co-authored-by: openhands <[email protected]> * refactor: deduplicate extract_custom_tag by importing from run_infer Remove duplicate implementation of extract_custom_tag in build_images.py and import it from run_infer.py instead. This avoids code duplication and ensures both modules use the same implementation. Co-authored-by: openhands <[email protected]> * docs: clarify SHORT_SHA source in run_infer.py Add comment explaining that SHORT_SHA is computed from the benchmarks repo's git commit (via git rev-parse HEAD in cwd), not the SDK submodule. This makes it clear that images are tagged with the benchmarks repo commit for reproducibility and traceability. Co-authored-by: openhands <[email protected]> * update sdk * refactor * remove tagging changes * bump commit * simplify build script * bump version * bump * bump * refactor build util into shared file * simplify build on the fly logic * remove targets and platform * Add automatic comment to issue #81 on successful build This adds a new step to the build-and-push workflow that: - Posts a comment to issue #81 when the build completes successfully - Includes dataset name, split, SDK version, and workflow run link - Lists all built image tags in a collapsible markdown section Co-authored-by: openhands <[email protected]> * Fix SDK URL and add workflow trigger information - Corrected SDK repository URL from All-Hands-AI/agent-sdk to OpenHands/software-agent-sdk - Added 'Triggered by' field to comment to show workflow trigger source - Updated .openhands/microagents/repo.md with correct SDK URL Co-authored-by: openhands <[email protected]> * Update .gitignore to properly allow .openhands/microagents/ Changed .openhands/ to .openhands/* so that negation patterns work correctly Co-authored-by: openhands <[email protected]> * Add error handling to skip comment when no images are built The comment step now checks if manifest.jsonl files exist and contain data before attempting to post a comment. This prevents posting comments with '0 images' when builds complete successfully but produce no output (e.g., during PR testing or when the build step is skipped). Co-authored-by: openhands <[email protected]> * Fix manifest file path detection using find command The previous check using 'builds/*/manifest.jsonl' only looked one level deep, but the actual path is 'builds/princeton-nlp/SWE-bench_Verified/test/manifest.jsonl' which is three levels deep. Using 'find' command now correctly locates manifest files at any depth within the builds directory. Tested with actual artifact from run #19182998503 containing 10 images. Co-authored-by: openhands <[email protected]> * bump sdk * increase n work and n limit * Show only one tag per image in issue comment Each image has multiple tags (base tag + detailed tag with hash). Now showing only the first (cleaner) tag per image to reduce clutter in the issue comment, making it easier to read. Co-authored-by: openhands <[email protected]> * bump sdk commit * increase to 500 limit and 32 concurrency * disable rebuild on every push * Fix workflow summary mismatch: use manifest.jsonl instead of summary.json The new builder in build_utils.py only writes manifest.jsonl, not summary.json. This commit updates the workflow to: - Remove summary.json from artifact upload path - Generate build summary from manifest.jsonl instead of summary.json - Display total/successful/failed counts and list failed builds Co-authored-by: openhands <[email protected]> * Remove redundant 'Upload build manifest' step The 'Archive build logs' step already packages the entire builds/ directory (including manifest.jsonl files) into build-logs.tar.gz, so a separate step to upload manifest.jsonl is redundant. Co-authored-by: openhands <[email protected]> * bump sdk to v1.1 * support remote runtime & bump ver again * fix target type * bump sdk * check image exists before launching remote runtime job * trying fixing docker build trigger * fix typo * tweak * tweak * drop default * sleep after failure * check target image existence before build * misc improvements * Improve multiprocessing logging and add metadata persistence - Replace PID-based worker logs with instance-specific logs - Add reset_logger_for_multiprocessing() for cleaner per-instance logging - Logs saved to <eval_output_dir>/logs/instance_{instance_id}.log - Single INFO message to console with tail hint, then WARNING+ only - Remove redundant _child_init() function - Auto-save metadata.json to eval_output_dir on initialization Co-authored-by: openhands <[email protected]> * Add stdout/stderr redirection for instance-specific logging - Add redirect_stdout_stderr() context manager to capture all output - Redirect stdout/stderr in _process_one_mp to instance log files - Captures SDK visualizations, print statements, and all console output - All output now goes to instance_{instance_id}.log files Co-authored-by: openhands <[email protected]> * Add proper KeyboardInterrupt handling with process cleanup - Explicitly terminate worker processes on KeyboardInterrupt - Access ProcessPoolExecutor._processes to call terminate() on each worker - Add comprehensive test suite to validate cleanup behavior - Test both normal interrupt and immediate interrupt scenarios Tests verify: - All worker processes are terminated after Ctrl+C - No zombie processes remain - Cleanup works even with early interrupts Co-authored-by: openhands <[email protected]> * rename output file * Refactor: simplify pool cleanup with helper method Extract repetitive pool cleanup logic into _cleanup_pool() helper method. Before: - 54 lines of try-except-except-else-finally with duplicated cleanup code - Inline repetition of cancel/terminate/shutdown logic in each handler After: - 32 lines with clean try-except-except structure - Single _cleanup_pool() method handles all cleanup logic - More maintainable and easier to read Tests still pass ✅ Co-authored-by: openhands <[email protected]> --------- Co-authored-by: openhands <[email protected]>

xingyaoww mentioned this pull request Nov 6, 2025

Add GitHub workflow to auto-build SWE-Bench images and push to GHCR OpenHands/benchmarks#51

Merged

xingyaoww marked this pull request as ready for review November 6, 2025 20:58

xingyaoww and others added 3 commits November 6, 2025 15:58

Merge branch 'main' into feat/improve-docker-image-tagging

1697a65

openhands-agent added 3 commits November 6, 2025 23:12

revery pupi file

be35fad

xingyaoww changed the title ~~feat: improve Docker image tagging for reproducibility~~ refactor: git SHA-based Docker tags by default, versioned tags only on releases Nov 7, 2025

xingyaoww added 2 commits November 7, 2025 10:15

Merge branch 'main' into feat/improve-docker-image-tagging

1fa2a0f

refactor base slug system

d2b037c

xingyaoww commented Nov 7, 2025

View reviewed changes

neubig self-requested a review November 7, 2025 15:52

xingyaoww requested review from enyst, juanmichelini and simonrosenberg November 7, 2025 15:52

neubig approved these changes Nov 7, 2025

View reviewed changes

.github/workflows/pypi-release.yml Outdated Show resolved Hide resolved

xingyaoww merged commit 204d3a4 into main Nov 7, 2025
24 checks passed

xingyaoww deleted the feat/improve-docker-image-tagging branch November 7, 2025 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: git SHA-based Docker tags by default, versioned tags only on releases #1088

refactor: git SHA-based Docker tags by default, versioned tags only on releases #1088

Uh oh!

xingyaoww commented Nov 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

openhands-ai bot commented Nov 6, 2025

Uh oh!

xingyaoww commented Nov 6, 2025

Uh oh!

openhands-ai bot commented Nov 6, 2025

Uh oh!

openhands-ai bot commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

xingyaoww Nov 7, 2025

Uh oh!

openhands-ai bot Nov 7, 2025

Uh oh!

openhands-ai bot Nov 7, 2025

Uh oh!

neubig left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor: git SHA-based Docker tags by default, versioned tags only on releases #1088

refactor: git SHA-based Docker tags by default, versioned tags only on releases #1088

Uh oh!

Conversation

xingyaoww commented Nov 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

1. Git SHA-Based Tagging (Default Behavior)

2. Semantic Version Tags Only on Git Tag Releases

3. SDK_SHA and SDK_REF Environment Variable Overrides

4. Target Suffix Simplification

Workflow Changes

.github/workflows/server.yml

.github/workflows/pypi-release.yml

API Changes

build.py Changes

Example Tag Outputs

Regular PR/Push (SHA-based only)

Main Branch Push (SHA + main tags)

Git Tag Release (SHA + versioned tags)

Non-binary Target (with suffix)

Benefits of This Approach

Backward Compatibility

Testing

Migration Guide

For SDK Repository

For Downstream Projects (Submodule Use Case)

Related

Visual Explanation

Regular Build Flow (PR/Main)

Release Build Flow (Git Tag)

Submodule Override Flow

Uh oh!

openhands-ai bot commented Nov 6, 2025

Uh oh!

xingyaoww commented Nov 6, 2025

Uh oh!

openhands-ai bot commented Nov 6, 2025

Uh oh!

openhands-ai bot commented Nov 6, 2025

Summary of Changes

✅ Changes Completed

1. build.py Refactoring (openhands-agent-server/openhands/agent_server/docker/build.py)

2. server.yml Workflow (.github/workflows/server.yml)

3. pypi-release.yml Workflow (.github/workflows/pypi-release.yml)

🎯 Key Benefits

📊 Image Tag Examples

Uh oh!

github-actions bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xingyaoww Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 7, 2025

Choose a reason for hiding this comment

Review Complete: All Tests Match Implementation ✅

Summary

Test Coverage Validation

Quality Checks Passed

Key Findings

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xingyaoww commented Nov 6, 2025 •

edited by github-actions bot

Loading

`.github/workflows/server.yml`

`.github/workflows/pypi-release.yml`

`build.py` Changes

1. build.py Refactoring (`openhands-agent-server/openhands/agent_server/docker/build.py`)

2. server.yml Workflow (`.github/workflows/server.yml`)

3. pypi-release.yml Workflow (`.github/workflows/pypi-release.yml`)

github-actions bot commented Nov 6, 2025 •

edited

Loading