-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Problem Statement
When using ToolSimulator to test agent resilience (not just workflow correctness), there's no way to intercept tool calls before or after the LLM generates a response. This makes it impossible to programmatically inject faults like rate limits, timeouts, or partial outages.
We tried several approaches:
- Putting failure instructions in the state description ("return a 429 error on call number 4"). The LLM ignores this consistently due to its helpfulness bias, even with strong prompting. Even when the state explicitly says "MUST return error" and "if you skip the error the test fails," the LLM produces a successful response. This is also unreliable due to LLM non-determinism: on rare occasions it might comply, but you can't depend on it for repeatable testing.
- Wrapping the function with a decorator before registration. Doesn't work because the simulator only uses the function for its name, signature, and docstring. At call time,
_create_tool_wrapperreplaces the function entirely with LLM inference, so the decorator never executes.
Additionally, when tools are called concurrently by the agent, there's no way to coordinate fault behavior across parallel calls using the state-based approach. The LLM generating each tool response operates independently and has no awareness of what other concurrent calls are doing or returning.
The only working approach is subclassing ToolSimulator and overriding _call_tool, which couples to a private method.
Proposed Solution
Add optional pre_call_hook and post_call_hook parameters to ToolSimulator.__init__:
simulator = ToolSimulator(
model=model,
pre_call_hook=my_fault_injector,
post_call_hook=my_response_modifier,
)The pre_call_hook is called before the LLM generates a response. It receives the tool name, parameters, state key, and previous call history. If it returns a non-None dict, that dict is returned as the tool response (short-circuiting the LLM call). If it returns None, normal simulation proceeds. The fault response should still be cached via state_registry.cache_tool_call so subsequent calls see the failure in their context.
def my_fault_injector(tool_name, parameters, state_key, previous_calls):
if random.random() < 0.3:
return {"error": {"code": "QuotaExceeded", "retryAfterSeconds": 2}}
return NoneThe post_call_hook is called after the LLM generates a response but before it's cached. It receives the same context plus the response dict, and returns a (possibly modified) response.
def my_response_modifier(tool_name, parameters, state_key, response):
response["_simulated_latency_ms"] = random.randint(50, 500)
return responseThis is a small change (two optional params on __init__, a few lines in _call_tool), fully backward compatible, and follows the same hook pattern the SDK uses at the agent level.
Use Case
- Rate limiting: randomly return 429 errors to test agent retry logic
- Partial failures: specific tools fail intermittently while others work
- Timeouts: hard cutoff after N calls to test graceful degradation
- Response modification: inject latency metadata, add missing fields, corrupt responses for robustness testing
- Chaos testing: compose multiple fault types to simulate degraded API conditions
Alternatives Solutions
- State-based LLM instructions: unreliable, LLM ignores error instructions
- Subclassing
ToolSimulatorand overriding_call_tool: works but couples to a private method
Additional Context
Related to issue #114 (Chaos/Resiliency Evaluation of Agents). The hooks proposed here would provide the low-level extension point that a higher-level chaos testing library (as described in #114) could build on.
We have a working implementation using the subclass approach and are happy to contribute a PR for the hooks.