Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Nov 4, 2025

This PR adds a Tool Pydantic validator that handles cases where LLMs (like GLM-4) return array/object values
as JSON strings instead of native JSON arrays/objects i.e.
<parameter=view_range>"[1, 100]" instead of
<parameter=view_range>[1, 100].


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:3444f22-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-3444f22-python \
  ghcr.io/openhands/agent-server:3444f22-python

All tags pushed for this build

ghcr.io/openhands/agent-server:3444f22-golang-amd64
ghcr.io/openhands/agent-server:v1.0.0a5_golang_tag_1.21-bookworm_binary-amd64
ghcr.io/openhands/agent-server:3444f22-golang-arm64
ghcr.io/openhands/agent-server:v1.0.0a5_golang_tag_1.21-bookworm_binary-arm64
ghcr.io/openhands/agent-server:3444f22-java-amd64
ghcr.io/openhands/agent-server:v1.0.0a5_eclipse-temurin_tag_17-jdk_binary-amd64
ghcr.io/openhands/agent-server:3444f22-java-arm64
ghcr.io/openhands/agent-server:v1.0.0a5_eclipse-temurin_tag_17-jdk_binary-arm64
ghcr.io/openhands/agent-server:3444f22-python-amd64
ghcr.io/openhands/agent-server:v1.0.0a5_nikolaik_s_python-nodejs_tag_python3.12-nodejs22_binary-amd64
ghcr.io/openhands/agent-server:3444f22-python-arm64
ghcr.io/openhands/agent-server:v1.0.0a5_nikolaik_s_python-nodejs_tag_python3.12-nodejs22_binary-arm64
ghcr.io/openhands/agent-server:3444f22-golang
ghcr.io/openhands/agent-server:3444f22-java
ghcr.io/openhands/agent-server:3444f22-python

About Multi-Architecture Support

  • Each variant tag (e.g., 3444f22-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 3444f22-python-amd64) are also available if needed

@simonrosenberg simonrosenberg marked this pull request as ready for review November 4, 2025 10:32
@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted read the diff and tell me if _decode_json_strings can be improved

@openhands-ai
Copy link

openhands-ai bot commented Nov 4, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 4, 2025

Summary

I've completed a critical code review of the _decode_json_strings function in PR #1017 as requested. No code changes were made - this was a review-only task.

Review Summary

Overall Assessment: 🟡 Acceptable - The function solves a real problem (GLM-4 returning JSON-encoded strings instead of native arrays/objects) but has some complexity issues that could be improved.

Key Findings

Critical Issues:

  1. Lines 145-148: Fragile Union type detection - checks for typing.Union and types.UnionType separately, treating a normal case as a special case
  2. Lines 163-166: Double-checking after json.loads() - parsing JSON then immediately verifying it's a list/dict adds unnecessary overhead

Improvement Opportunities:

  1. Lines 150-158: Data flow branching on is_union creates unnecessary complexity - could handle both unions and simple types uniformly using get_args()
  2. Line 129: list(data.items()) creates unnecessary copy unless modification during iteration is a real concern
  3. Lines 129-136: Multiple early exits could be consolidated for better readability

Core Insight

The function does too much type introspection to detect unions. Instead of asking "is this a union?" and branching, it should just ask "what types could this field accept?" using get_args() uniformly. This would eliminate special-case handling and reduce the function by ~10 lines while improving clarity.

Verdict: ✅ Worth merging after addressing the Union type handling brittleness and removing the redundant isinstance check after json.loads().

View full conversation

openhands-agent and others added 3 commits November 4, 2025 10:41
- Remove fragile Union type detection that checked typing.Union and types.UnionType separately
- Use get_args() uniformly for both union and simple types, eliminating branching logic
- Consolidate early exit conditions into single check for better readability
- Remove unnecessary list() copy when iterating over data.items()
- Remove unused imports: types module and Union from typing
- Improve code comments to clarify json.loads() return types
- Reduce function from 46 lines to 34 lines (-26%) while maintaining functionality

This addresses code review feedback about treating normal cases as special cases
and eliminates brittleness around Python version differences in union handling.

Co-authored-by: openhands <[email protected]>
…correctly

This commit addresses two critical issues identified in code review:

1. **Annotated type handling**: The validator now properly unwraps Annotated types
   before checking for list/dict, ensuring fields like `Annotated[list[str], Field(...)]`
   are correctly decoded from JSON strings.

2. **Field alias support**: Changed from using __annotations__ to model_fields,
   which properly handles field aliases and inherited fields. The validator now
   checks both field names and their aliases when looking up values in the input data.

3. **Union type detection fix**: Fixed a critical bug where we were checking the
   origin of type arguments instead of the type itself. For example, with `list[str]`,
   we were incorrectly extracting `str` instead of `list`.

Changes:
- Import Annotated, Union, and types module
- Use cls.model_fields instead of cls.__annotations__
- Add logic to unwrap Annotated types before type checking
- Fix Union type detection to check the type's origin, not its args
- Properly handle both old-style Union and new-style (|) union syntax

The validator now correctly handles:
- `Annotated[list[str], Field(description='...')]`
- Fields with aliases: `my_list: list[int] = Field(alias='myList')`
- Union types: `list[str] | None`
- Inherited fields from parent classes

Co-authored-by: openhands <[email protected]>
@simonrosenberg
Copy link
Collaborator Author

Integrations tests ran locally for glm-4.6

  "model_name": "litellm_proxy_openrouter_z_ai_glm_4.6",
  "run_suffix": "litellm_proxy_openrouter_z_ai_glm_4.6_glm46_run_N7_20251104_121354",
  "llm_config": {
    "model": "litellm_proxy/openrouter/z-ai/glm-4.6"
  },
  "timestamp": "2025-11-04T12:15:31.274341",
  "test_instances": [
    {
      "instance_id": "t07_interactive_commands",
      "test_result": {
        "success": true,
        "reason": "Interactive Python script setup completed. Agent should execute the script with inputs 'John' and '25' and find the secret number: 707"
      },
      "cost": 0.009419649999999998,
      "error_message": null
    },
    {
      "instance_id": "t03_jupyter_write_file",
      "test_result": {
        "success": true,
        "reason": "Successfully created file with content: hello world"
      },
      "cost": 0.0092353,
      "error_message": null
    },
    {
      "instance_id": "t04_git_staging",
      "test_result": {
        "success": true,
        "reason": "Successfully committed changes with message: 'Add initial hello world script'"
      },
      "cost": 0.009107899999999999,
      "error_message": null
    },
    {
      "instance_id": "t06_github_pr_browsing",
      "test_result": {
        "success": true,
        "reason": "Agent's final answer contains information about the PR content"
      },
      "cost": 0.024863,
      "error_message": null
    },
    {
      "instance_id": "t02_add_bash_hello",
      "test_result": {
        "success": true,
        "reason": "Successfully created and executed script: hello"
      },
      "cost": 0.013697899999999999,
      "error_message": null
    },
    {
      "instance_id": "t01_fix_simple_typo",
      "test_result": {
        "success": true,
        "reason": "Successfully fixed all typos"
      },
      "cost": 0.0200777,
      "error_message": null
    },
    {
      "instance_id": "t05_simple_browsing",
      "test_result": {
        "success": true,
        "reason": "Agent successfully found the answer! Matched pattern: (?i)openhands is all you need. Response contained the expected content about OpenHands."
      },
      "cost": 0.015120499999999999,
      "error_message": null
    }
  ],
  "total_tests": 7,
  "successful_tests": 7,
  "success_rate": 1.0,
  "total_cost": 0.10152195,
  "eval_note": "glm46_run",
  "artifact_url": null,
  "status": "completed"
}```

@simonrosenberg simonrosenberg self-assigned this Nov 4, 2025
@simonrosenberg simonrosenberg added the integration-test Runs the integration tests and comments the results label Nov 4, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

🧪 Integration Tests Results

Overall Success Rate: 100.0%
Total Cost: $0.78
Models Tested: 3
Timestamp: 2025-11-04 11:21:34 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Success Rate Tests Passed Total Tests Cost
litellm_proxy_deepseek_deepseek_chat 100.0% 7/7 7 $0.03
litellm_proxy_gpt_5_mini_2025_08_07 100.0% 7/7 7 $0.03
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 7/7 7 $0.72

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.03
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_dd61596_deepseek_run_N7_20251104_111820

litellm_proxy_gpt_5_mini_2025_08_07

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.03
  • Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_dd61596_gpt5_mini_run_N7_20251104_111814

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.72
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_dd61596_sonnet_run_N7_20251104_111820

@simonrosenberg
Copy link
Collaborator Author

@neubig And the MiniMax-M2 integration tests ran locally

  "model_name": "anthropic_MiniMax_M2",
  "run_suffix": "anthropic_MiniMax_M2_minimax_run_v2_N7_20251104_123831",
  "llm_config": {
    "model": "anthropic/MiniMax-M2"
  },
  "timestamp": "2025-11-04T12:40:33.529421",
  "test_instances": [
    {
      "instance_id": "t07_interactive_commands",
      "test_result": {
        "success": true,
        "reason": "Interactive Python script setup completed. Agent should execute the script with inputs 'John' and '25' and find the secret number: 707"
      },
      "cost": 0.0,
      "error_message": null
    },
    {
      "instance_id": "t04_git_staging",
      "test_result": {
        "success": true,
        "reason": "Successfully committed changes with message: 'Add hello.py - initial hello world script'"
      },
      "cost": 0.0,
      "error_message": null
    },
    {
      "instance_id": "t06_github_pr_browsing",
      "test_result": {
        "success": true,
        "reason": "Agent's final answer contains information about the PR content"
      },
      "cost": 0.0,
      "error_message": null
    },
    {
      "instance_id": "t05_simple_browsing",
      "test_result": {
        "success": true,
        "reason": "Agent successfully found the answer! Matched pattern: (?i)the answer is openhands is all you need. Response contained the expected content about OpenHands."
      },
      "cost": 0.0,
      "error_message": null
    },
    {
      "instance_id": "t02_add_bash_hello",
      "test_result": {
        "success": true,
        "reason": "Successfully created and executed script: hello"
      },
      "cost": 0.0,
      "error_message": null
    },
    {
      "instance_id": "t01_fix_simple_typo",
      "test_result": {
        "success": true,
        "reason": "Successfully fixed all typos"
      },
      "cost": 0.0,
      "error_message": null
    },
    {
      "instance_id": "t03_jupyter_write_file",
      "test_result": {
        "success": true,
        "reason": "Successfully created file with content: hello world"
      },
      "cost": 0.0,
      "error_message": null
    }
  ],
  "total_tests": 7,
  "successful_tests": 7,
  "success_rate": 1.0,
  "total_cost": 0.0,
  "eval_note": "minimax_run_v2",
  "artifact_url": null,
  "status": "completed"
}```

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it'd be good to have a unit test to make sure we don't accidentally regress this later.

@simonrosenberg simonrosenberg changed the title Add glm 4.6 model 3 Add tool validator to enable tool calls for glm-4.6 Nov 4, 2025
…ator

This commit adds 16 unit tests to verify the _decode_json_strings validator
works correctly for all edge cases:

**Basic functionality:**
- JSON strings decoded to native lists/dicts
- Native lists/dicts pass through unchanged
- Regular string fields not affected by JSON decoding

**Advanced type handling:**
- Annotated types (e.g., Annotated[list[str], Field(...)])
- Field aliases (e.g., Field(alias='myList'))
- Optional/Union types (e.g., list[str] | None)
- Nested structures (e.g., list[list[int]])

**Error handling:**
- Invalid JSON strings rejected with ValidationError
- JSON strings with wrong types rejected
- Empty collections handled correctly

**Edge cases:**
- Unicode characters in JSON strings
- Extra whitespace in JSON strings
- Mixed native and JSON string values in same model

All tests pass and validate that the validator correctly handles both
JSON-encoded strings from models like GLM-4 and regular native objects
from other models.

Co-authored-by: openhands <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/tool
   schema.py1394468%23–30, 36–38, 148, 166–167, 186–187, 190–192, 195, 214, 227–228, 230–231, 233–236, 239, 243–245, 247–248, 250, 257, 301–305, 307–308
TOTAL11659537053% 

@simonrosenberg simonrosenberg requested review from enyst and neubig and removed request for xingyaoww November 4, 2025 13:05
@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted read the diff. I wonder if we can do this in tool.action_from_arguments inside agent.py instead this low-level data structure?
Making the change here would potentially affect other part of the system in unintended ways
Draw the pros and cons of this solution vs another solution

@openhands-ai
Copy link

openhands-ai bot commented Nov 5, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 5, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Agent Server

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1017 at branch `add-glm-4.6-model-3`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@openhands-ai
Copy link

openhands-ai bot commented Nov 5, 2025

Summary

I've completed a comprehensive architectural code review of PR #1017 as requested. No code changes were made - this was a pure analysis task.

Your Question

You asked whether the JSON string decoding fix should be:

  1. Current approach: Low-level in Schema base class (tool/schema.py) via Pydantic validator
  2. Alternative approach: High-level in agent.py inside tool.action_from_arguments() as preprocessing

My Answer

The current low-level solution in schema.py is the correct architectural choice, despite your valid concern about broader scope.

Key Reasoning:

✅ Current Solution (Schema validator) PROS:

  • Single source of truth - no code duplication
  • Complete coverage: works for agent tool calls, MCP tools, tests, and direct instantiation
  • Idiomatic Pydantic pattern using @model_validator(mode="before")
  • Future-proof: new code paths automatically benefit
  • Clean separation of concerns: data validation in data layer

❌ Alternative Solution (Agent preprocessing) CONS:

  • Would require duplicating logic in multiple places (agent.py, mcp/tool.py, future integrations)
  • Incomplete coverage: MCP tools have separate action_from_arguments() override
  • Violates DRY principle
  • Easy to miss when adding new LLM integrations
  • Creates inconsistent behavior based on how Actions are instantiated

Addressing Your Concern

Your concern that "making the change here would potentially affect other part of the system in unintended ways" is valid, but mitigated by:

  1. Defensive implementation: Only acts on strings, only for list/dict fields, gracefully handles parse failures
  2. Comprehensive test coverage: 257 lines of tests covering edge cases
  3. No breaking changes: Correct code works as before; incorrect code gets auto-fixed (improvement!)
  4. Narrow condition: Regular string fields are NOT affected, only collection types

The "Good Taste" Perspective

The current solution eliminates special cases rather than creating them. Instead of scattering "LLM quirk handling" across multiple locations, it solves the data normalization problem once in the data validation layer where it belongs.


Recommendation: Keep the current implementation. The broader scope is a feature, not a bug - it ensures consistent behavior across the entire SDK.

View full conversation

@simonrosenberg
Copy link
Collaborator Author

Summary

I've completed a comprehensive architectural code review of PR #1017 as requested. No code changes were made - this was a pure analysis task.

Your Question

You asked whether the JSON string decoding fix should be:

  1. Current approach: Low-level in Schema base class (tool/schema.py) via Pydantic validator
  2. Alternative approach: High-level in agent.py inside tool.action_from_arguments() as preprocessing

My Answer

The current low-level solution in schema.py is the correct architectural choice, despite your valid concern about broader scope.

Key Reasoning:

✅ Current Solution (Schema validator) PROS:

  • Single source of truth - no code duplication
  • Complete coverage: works for agent tool calls, MCP tools, tests, and direct instantiation
  • Idiomatic Pydantic pattern using @model_validator(mode="before")
  • Future-proof: new code paths automatically benefit
  • Clean separation of concerns: data validation in data layer

❌ Alternative Solution (Agent preprocessing) CONS:

  • Would require duplicating logic in multiple places (agent.py, mcp/tool.py, future integrations)
  • Incomplete coverage: MCP tools have separate action_from_arguments() override
  • Violates DRY principle
  • Easy to miss when adding new LLM integrations
  • Creates inconsistent behavior based on how Actions are instantiated

Addressing Your Concern

Your concern that "making the change here would potentially affect other part of the system in unintended ways" is valid, but mitigated by:

  1. Defensive implementation: Only acts on strings, only for list/dict fields, gracefully handles parse failures
  2. Comprehensive test coverage: 257 lines of tests covering edge cases
  3. No breaking changes: Correct code works as before; incorrect code gets auto-fixed (improvement!)
  4. Narrow condition: Regular string fields are NOT affected, only collection types

The "Good Taste" Perspective

The current solution eliminates special cases rather than creating them. Instead of scattering "LLM quirk handling" across multiple locations, it solves the data normalization problem once in the data validation layer where it belongs.

Recommendation: Keep the current implementation. The broader scope is a feature, not a bug - it ensures consistent behavior across the entire SDK.

View full conversation

@xingyaoww wdyt?

@xingyaoww
Copy link
Collaborator

@simonrosenberg, how about we create a helper function and use it in one place (i.e., when you json.loads the result returned by the model)? then the json.loads results are then passed to action_from_arguments -- in this way we still only need to modify one place. @OpenHands what do you think? /codereview-roasted

@openhands-ai
Copy link

openhands-ai bot commented Nov 5, 2025

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 5, 2025

Summary

I've completed a code review analysis of PR #1017 in response to @simonrosenberg's question about whether the JSON string decoding logic should be moved from the low-level Schema._decode_json_strings() Pydantic validator to a helper function in agent.py.

My Answer: Keep the Current Pydantic Validator Approach

Key Reasoning:

  1. Type Safety is Critical: The Pydantic validator has access to field type annotations and knows exactly which fields expect list or dict types. A helper function in agent.py would be type-blind and would need to rely on fragile heuristics (like checking if a string starts with [ or {) to guess which strings should be decoded.

  2. Eliminates Edge Cases Elegantly: Following Linus's "good taste" principle, the validator eliminates the special case at the right boundary - where external untrusted data (LLM responses) becomes validated domain objects. The data structure is responsible for its own validation/normalization.

  3. No Duplication of Type Knowledge: The helper function approach would either:

    • Try to decode ALL strings (breaks legitimate string fields)
    • Use heuristics (fragile and error-prone)
    • Need to duplicate field type information (violates DRY)
  4. Safe & Defensive Implementation: The current validator:

    • Only runs on dict inputs (not existing Schema instances)
    • Only decodes strings for fields explicitly typed as list or dict
    • Silently preserves original value if JSON parsing fails
    • Benefits all Action creation paths (agent.py, tests, MCP tools)

Addressing the "Unintended Impact" Concern:

The validator only affects Schema creation from dicts, which is exactly where this LLM quirk manifests. It's actually beneficial that tests and other integrations automatically get this fix if they encounter the same issue.

Key Insight

The Pydantic validator approach is data-structure-driven validation (elegant, type-safe), while the helper function approach would be heuristic-driven transformation (fragile, duplicates type knowledge). The validator follows the principle that data structures should validate themselves at boundaries.

No code changes were made - this was a review-only analysis providing technical feedback on architectural decisions.

View full conversation

@enyst
Copy link
Collaborator

enyst commented Nov 5, 2025

@simonrosenberg, how about we create a helper function and use it in one place (i.e., when you json.loads the result returned by the model)? then the json.loads results are then passed to action_from_arguments -- in this way we still only need to modify one place. @OpenHands what do you think? /codereview-roasted on this suggestion!

Didn't we have a json.loads overwritten by us, for exactly purposes like this, malformed LLM calls? And nothing else, we were only using for parsing LLM calls.

We did in V0 a looong time ago. (V0.00... 😅)

@openhands-ai
Copy link

openhands-ai bot commented Nov 5, 2025

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 5, 2025

Summary of my work and recommendation

  • I reviewed the diff and the current _decode_json_strings pre-validator in Schema.
  • Short answer to the question: Yes—prefer a dedicated helper used at a single boundary (right after json.loads of the model’s tool_call.arguments) rather than a Schema-level pre-validator, and definitely do not globally override json.loads.
  • Rationale:
    • Keeps the fix scoped to the LLM integration boundary (small blast radius, easier to reason about and revert).
    • Avoids coupling core Schema validation to a provider-specific quirk.
    • Prevents surprising behavior for non-LLM code paths that also instantiate Actions.

Comparison of approaches (brief):

  • Schema-level validator (current): Works everywhere but is a global behavior change; adds complexity (Union/Annotated handling) into core types and can have unintended side effects.
  • Agent boundary helper (recommended): Centralized, explicit normalization only for LLM tool calls; minimal risk to other parts of the system.
  • ToolDefinition.action_from_arguments variant: Also viable, but MCPToolDefinition overrides this method, so you must ensure both paths use the helper to avoid divergence.

Implementation sketch (if you want me to proceed):

  • Add normalize_llm_arguments(arguments, action_type) that:
    • Inspects action_type.model_fields (respect aliases, unwrap Annotated/Union).
    • If a field expects list/dict and the value is a str, attempt a single json.loads; accept only if result is list/dict.
  • Call it in Agent._get_action_event right after json.loads(tool_call.arguments) and before tool.action_from_arguments(...) (or in ToolDefinition.action_from_arguments and MCPToolDefinition.action_from_arguments as a shared utility).
  • Tests: Move/adjust existing tests away from Schema-level behavior to cover the helper via the agent/tool path.

No code changes were made; this was an analysis-only response. If you agree with the direction, I can implement the helper and refactor the call site(s).

View full conversation

@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Nov 10, 2025

[Automatic Post]: This PR seems to be currently waiting for review. @neubig, @enyst, @xingyaoww, could you please take a look when you have a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants