feat: Detect platform-side inference errors by statxc · Pull Request #332 · ridgesai/ridges

statxc · 2026-03-12T13:40:29Z

Detect platform-side inference errors so agents aren't penalized for provider failures

Closes #331

Problem

When an AI provider goes down or returns server errors (500, 502, etc.), the agent's inference() calls return None. The agent keeps running but produces a bad or empty patch because it has no LLM to work with. The platform then scores this patch normally - the agent gets a 0 for something that wasn't its fault.

There was no mechanism to distinguish "the agent wrote bad code" from "the providers were broken."

Solution

Track platform-side inference errors per evaluation run and flag the run as a platform error when the count exceeds a configurable threshold.

Platform errors are provider failures that the agent can't control:

500 Internal Server Error
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
-1 Internal provider error

Non-platform errors (400, 404, 422, 429) are excluded - those are the agent's fault (bad request, wrong model, exceeded cost limit).

What changed

File	What changed
`inference_gateway/error_hash_map.py` (new)	`ErrorHashMap` class that tracks inference error counts per `evaluation_run_id`, with the same auto-cleanup pattern as the existing `CostHashMap`.
`inference_gateway/config.py`	Added `MAX_INFERENCE_ERRORS_PER_EVALUATION_RUN` (defaults to 5 if not set in `.env`). Existing deployments won't break.
`inference_gateway/main.py`	Counts platform errors after each inference/embedding call. Blocks further requests with `503` once the threshold is hit. Extended `/api/usage` to include `inference_errors` and `max_inference_errors`. Added `logger.warning()` when errors are counted and when threshold blocks a request.
`models/evaluation_run.py`	Added `PLATFORM_TOO_MANY_INFERENCE_ERRORS = 3050` in the 3xxx platform error range.
`validator/main.py`	After agent finishes, queries `/api/usage` on the inference gateway with a 10s timeout. If errors exceed the limit, marks the run as a platform error (3050) instead of scoring the patch. Also wired up the `extra` field in `EvaluationRunException` handling - it was designed but never passed through. Now `agent_logs` are included when reporting platform errors.
`tests/test_inference_error_tracking.py` (new)	19 tests covering `ErrorHashMap` unit behavior, platform error classification, error code validation, and integration tests against both inference and embedding gateway endpoints.

How it works end-to-end

Agent calls inference() → provider returns 500 → gateway counts error → agent gets None
Agent calls inference() → provider returns 500 → gateway counts error → agent gets None
...5th error...
Agent calls inference() → gateway returns 503 (blocked) → agent finishes with bad patch
Validator checks /api/usage → sees errors >= limit → marks run as PLATFORM error (3050)
→ Agent is not scored on this run

Config

Add to your .env if you want to override the default:

MAX_INFERENCE_ERRORS_PER_EVALUATION_RUN=5

Testing

python3 -m pytest tests/test_inference_error_tracking.py -v

…d for provider failures

statxc · 2026-03-12T13:41:29Z

@camfairchild Could you please review the PR? I'd appreciate any feedbacks.

camfairchild · 2026-03-12T13:57:09Z

inference_gateway/main.py

+def is_non_halting_error(status_code: int) -> bool:
+    return status_code in NON_HALTING_ERROR_CODES


Would be better termed "platform error" or something.

By "halting error" I meant one that isn't caught properly and halts the process

camfairchild · 2026-03-12T14:05:01Z

Looks good otherwise. Thank you

statxc · 2026-03-12T14:12:29Z

@camfairchild Thanks for your feedback. I updated name to platform error. Could you review again?

…d embedding and edge case tests

statxc · 2026-03-12T17:08:38Z

@ibraheem-abe Could you please review this PR? Welcome to any feedbacks.
thanks

statxc · 2026-03-16T06:03:02Z

please give me any feedbacks

ibraheem-abe · 2026-03-16T07:27:02Z

@statxc
Currently the entire evaluation gets restarted if we have failed inferences greater than the threshold.

We want to change this so only that specific test is retried instead of the entire thing

statxc · 2026-03-16T16:20:03Z

Hey @ibraheem-abe, thanks for the feedback! I think a single-run retry is out of scope for this PR. #331 was about detecting platform-side inference errors and stopping early, and this PR addresses that.

Implementing retries from our side is difficult: the platform state machine is one-way (once a run reaches running_agent, it can't go back to initializing_agent), and the validator can't create new run IDs. So retrying a single run would require changes to platform-side scheduling.

As an interim fix, I could update the SQL view so error 3050 doesn't fail the entire evaluation - failed runs would score 0 and the other results would be preserved. But I think the actual retry behavior is better handled in a separate PR.

@statxc

…ailing entire evaluation Based on ridgesai#332 by @statxc which detects platform-side inference errors. Changes the behavior so that when an evaluation run hits the inference error threshold, only that specific run is retried (up to 2 times) instead of marking the entire evaluation as failed. Flow: 1. Agent finishes → validator checks /api/usage for inference errors 2. If errors >= threshold and retries remaining: - Reset error counter via POST /api/reset-inference-errors - Re-run only this specific problem (not the whole evaluation) 3. If errors >= threshold and retries exhausted: - Mark as PLATFORM_TOO_MANY_INFERENCE_ERRORS (3050) New additions on top of ridgesai#332: - ErrorHashMap.reset_inference_errors() method - POST /api/reset-inference-errors gateway endpoint - Retry loop in _run_evaluation_run() with MAX_SINGLE_RUN_RETRIES=2 - Tests for reset behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

statxc · 2026-03-25T11:40:28Z

@ibraheem-abe Sorry to bother you again. I'd appreciate you approve this PR if it has no problem.

ibraheem-abe · 2026-03-25T12:09:56Z

@statxc
I would like to test and review this with a working dev environment to ensure we don't miss or induce any edge cases.
Thank you for your patience!

CryptoAardvark

Hi @statxc, hope you don't mind a drive-by review, this PR caught my eye and the design is really thoughtful, especially the separation of platform errors from agent-caused ones and the -1 sentinel for network failures. I had two small observations I wanted to share in case they're useful. Please feel free to disregard anything that doesn't fit the project's conventions. 🙏

CryptoAardvark · 2026-04-05T16:00:03Z

inference_gateway/error_hash_map.py

+        if now - self.last_cleanup_at > ERROR_HASH_MAP_CLEANUP_INTERVAL_SECONDS:
+            self.error_hash_map = {k: v for k, v in self.error_hash_map.items() if now - v.last_accessed_at < ERROR_HASH_MAP_CLEANUP_INTERVAL_SECONDS}


Suggestion(non-blocking): since this map is accessed from concurrent FastAPI requests, the dict reassignment here (and the check-then-increment pattern in main.py) might benefit from a threading.Lock. Concurrent requests for the same evaluation_run_id could slip past the threshold check before blocking kicks in.

Good catch on the race condition. The fix would actually need to be an asyncio.Lock in the request handler (around the check-await-increment sequence in main.py), not a threading.Lock in the hash map.
The individual methods are already atomic in asyncio. Same pattern exists in CostHashMap for cost limits. Worst case is one extra error slipping past the threshold, so I'll leave this as-is for now.

CryptoAardvark · 2026-04-05T16:04:12Z

inference_gateway/error_hash_map.py

+        self._cleanup()
+
+        if uuid in self.error_hash_map:
+            self.error_hash_map[uuid].last_accessed_at = time.time()


Suggetion(non-blocking): get_inference_errors refreshes last_accessed_at on reads. If /api/usage polls this for a completed run, it could keep the entry alive past the 60s TTL. I think this might be worth moving the touch out of the get_* path so reads stay pure.

Fixed. thanks

…tform-inference-error-detection

statxc · 2026-04-06T11:24:03Z

@camfairchild Could you please merge this PR since it already has two approvals?

feat: Detect platform-side inference errors so agents aren't penalize…

8c62dba

…d for provider failures

camfairchild requested changes Mar 12, 2026

View reviewed changes

refactor: update halting name to platform error

cee757c

statxc requested a review from camfairchild March 12, 2026 14:12

refactor: Rename to platform error, add logging and httpx timeout, ad…

57d5889

…d embedding and edge case tests

camfairchild approved these changes Mar 12, 2026

View reviewed changes

camfairchild requested a review from ibraheem-abe March 12, 2026 15:21

thomasvangurp mentioned this pull request Mar 17, 2026

feat: Retry individual runs on platform inference errors #333

Open

5 tasks

fix: solve conflicts

a01d25f

CryptoAardvark reviewed Apr 5, 2026

View reviewed changes

statxc added 2 commits April 5, 2026 21:28

Merge branch 'main' of https://github.com/statxc/ridges into feat/pla…

23730ff

…tform-inference-error-detection

update

25def2e

statxc requested a review from CryptoAardvark April 5, 2026 19:44

CryptoAardvark approved these changes Apr 6, 2026

View reviewed changes

		def is_non_halting_error(status_code: int) -> bool:
		return status_code in NON_HALTING_ERROR_CODES

		if now - self.last_cleanup_at > ERROR_HASH_MAP_CLEANUP_INTERVAL_SECONDS:
		self.error_hash_map = {k: v for k, v in self.error_hash_map.items() if now - v.last_accessed_at < ERROR_HASH_MAP_CLEANUP_INTERVAL_SECONDS}

Conversation

statxc commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Detect platform-side inference errors so agents aren't penalized for provider failures

Problem

Solution

What changed

How it works end-to-end

Config

Testing

Uh oh!

statxc commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camfairchild Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

camfairchild commented Mar 12, 2026

Uh oh!

statxc commented Mar 12, 2026

Uh oh!

statxc commented Mar 12, 2026

Uh oh!

statxc commented Mar 16, 2026

Uh oh!

ibraheem-abe commented Mar 16, 2026

Uh oh!

statxc commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

statxc commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibraheem-abe commented Mar 25, 2026

Uh oh!

CryptoAardvark left a comment

Choose a reason for hiding this comment

Uh oh!

CryptoAardvark Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

statxc Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

CryptoAardvark Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

statxc Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

statxc commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

statxc commented Mar 12, 2026 •

edited

Loading

statxc commented Mar 12, 2026 •

edited

Loading

statxc commented Mar 16, 2026 •

edited

Loading

statxc commented Mar 25, 2026 •

edited

Loading