feat: Detect platform-side inference errors#332
feat: Detect platform-side inference errors#332statxc wants to merge 6 commits intoridgesai:mainfrom
Conversation
…d for provider failures
|
@camfairchild Could you please review the PR? I'd appreciate any feedbacks. |
inference_gateway/main.py
Outdated
| def is_non_halting_error(status_code: int) -> bool: | ||
| return status_code in NON_HALTING_ERROR_CODES |
There was a problem hiding this comment.
Would be better termed "platform error" or something.
By "halting error" I meant one that isn't caught properly and halts the process
|
Looks good otherwise. Thank you |
|
@camfairchild Thanks for your feedback. I updated name to |
…d embedding and edge case tests
|
@ibraheem-abe Could you please review this PR? Welcome to any feedbacks. |
|
please give me any feedbacks |
|
@statxc We want to change this so only that specific test is retried instead of the entire thing |
|
Hey @ibraheem-abe, thanks for the feedback! I think a single-run retry is out of scope for this PR. #331 was about detecting platform-side inference errors and stopping early, and this PR addresses that. Implementing retries from our side is difficult: the platform state machine is one-way (once a run reaches As an interim fix, I could update the SQL view so error 3050 doesn't fail the entire evaluation - failed runs would score 0 and the other results would be preserved. But I think the actual retry behavior is better handled in a separate PR. |
…ailing entire evaluation Based on ridgesai#332 by @statxc which detects platform-side inference errors. Changes the behavior so that when an evaluation run hits the inference error threshold, only that specific run is retried (up to 2 times) instead of marking the entire evaluation as failed. Flow: 1. Agent finishes → validator checks /api/usage for inference errors 2. If errors >= threshold and retries remaining: - Reset error counter via POST /api/reset-inference-errors - Re-run only this specific problem (not the whole evaluation) 3. If errors >= threshold and retries exhausted: - Mark as PLATFORM_TOO_MANY_INFERENCE_ERRORS (3050) New additions on top of ridgesai#332: - ErrorHashMap.reset_inference_errors() method - POST /api/reset-inference-errors gateway endpoint - Retry loop in _run_evaluation_run() with MAX_SINGLE_RUN_RETRIES=2 - Tests for reset behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@ibraheem-abe Sorry to bother you again. I'd appreciate you approve this PR if it has no problem. |
|
@statxc |
CryptoAardvark
left a comment
There was a problem hiding this comment.
Hi @statxc, hope you don't mind a drive-by review, this PR caught my eye and the design is really thoughtful, especially the separation of platform errors from agent-caused ones and the -1 sentinel for network failures. I had two small observations I wanted to share in case they're useful. Please feel free to disregard anything that doesn't fit the project's conventions. 🙏
inference_gateway/error_hash_map.py
Outdated
| if now - self.last_cleanup_at > ERROR_HASH_MAP_CLEANUP_INTERVAL_SECONDS: | ||
| self.error_hash_map = {k: v for k, v in self.error_hash_map.items() if now - v.last_accessed_at < ERROR_HASH_MAP_CLEANUP_INTERVAL_SECONDS} |
There was a problem hiding this comment.
Suggestion(non-blocking): since this map is accessed from concurrent FastAPI requests, the dict reassignment here (and the check-then-increment pattern in main.py) might benefit from a threading.Lock. Concurrent requests for the same evaluation_run_id could slip past the threshold check before blocking kicks in.
There was a problem hiding this comment.
Good catch on the race condition. The fix would actually need to be an asyncio.Lock in the request handler (around the check-await-increment sequence in main.py), not a threading.Lock in the hash map.
The individual methods are already atomic in asyncio. Same pattern exists in CostHashMap for cost limits. Worst case is one extra error slipping past the threshold, so I'll leave this as-is for now.
inference_gateway/error_hash_map.py
Outdated
| self._cleanup() | ||
|
|
||
| if uuid in self.error_hash_map: | ||
| self.error_hash_map[uuid].last_accessed_at = time.time() |
There was a problem hiding this comment.
Suggetion(non-blocking): get_inference_errors refreshes last_accessed_at on reads. If /api/usage polls this for a completed run, it could keep the entry alive past the 60s TTL. I think this might be worth moving the touch out of the get_* path so reads stay pure.
…tform-inference-error-detection
|
@camfairchild Could you please merge this PR since it already has two approvals? |
Detect platform-side inference errors so agents aren't penalized for provider failures
Closes #331
Problem
When an AI provider goes down or returns server errors (500, 502, etc.), the agent's
inference()calls returnNone. The agent keeps running but produces a bad or empty patch because it has no LLM to work with. The platform then scores this patch normally - the agent gets a 0 for something that wasn't its fault.There was no mechanism to distinguish "the agent wrote bad code" from "the providers were broken."
Solution
Track platform-side inference errors per evaluation run and flag the run as a platform error when the count exceeds a configurable threshold.
Platform errors are provider failures that the agent can't control:
500Internal Server Error502Bad Gateway503Service Unavailable504Gateway Timeout-1Internal provider errorNon-platform errors (400, 404, 422, 429) are excluded - those are the agent's fault (bad request, wrong model, exceeded cost limit).
What changed
inference_gateway/error_hash_map.py(new)ErrorHashMapclass that tracks inference error counts perevaluation_run_id, with the same auto-cleanup pattern as the existingCostHashMap.inference_gateway/config.pyMAX_INFERENCE_ERRORS_PER_EVALUATION_RUN(defaults to 5 if not set in.env). Existing deployments won't break.inference_gateway/main.py503once the threshold is hit. Extended/api/usageto includeinference_errorsandmax_inference_errors. Addedlogger.warning()when errors are counted and when threshold blocks a request.models/evaluation_run.pyPLATFORM_TOO_MANY_INFERENCE_ERRORS = 3050in the 3xxx platform error range.validator/main.py/api/usageon the inference gateway with a 10s timeout. If errors exceed the limit, marks the run as a platform error (3050) instead of scoring the patch. Also wired up theextrafield inEvaluationRunExceptionhandling - it was designed but never passed through. Nowagent_logsare included when reporting platform errors.tests/test_inference_error_tracking.py(new)ErrorHashMapunit behavior, platform error classification, error code validation, and integration tests against both inference and embedding gateway endpoints.How it works end-to-end
Config
Add to your
.envif you want to override the default:Testing