Skip to content

fix(extract_json): tolerate non-strict model JSON (e.g. DeepSeek)#325

Closed
cnndabbler wants to merge 1 commit into
VectifyAI:mainfrom
cnndabbler:pr/extract-json-robustness
Closed

fix(extract_json): tolerate non-strict model JSON (e.g. DeepSeek)#325
cnndabbler wants to merge 1 commit into
VectifyAI:mainfrom
cnndabbler:pr/extract-json-robustness

Conversation

@cnndabbler

Copy link
Copy Markdown

Problem

extract_json() assumes the entire model response is JSON and returns {} on any
JSONDecodeError. Callers then access keys directly (e.g.
toc_detector_single_page does json_content['toc_detected']), so a single
response with stray prose/code-fences around the JSON raises KeyError and aborts
the whole index build with Processing failed.

This reproduces intermittently on models that don't return bare JSON — e.g.
deepseek/deepseek-v4-flash via LiteLLM. OpenAI/GLM happened to match the strict
path, so it wasn't caught.

Fix

  • extract_json: add a balanced-brace fallback (_extract_balanced_json) that pulls
    the first {...}/[...] object out of the raw response when direct parsing fails —
    handles models that wrap JSON in prose or fences.
  • toc_detector_single_page: use .get('toc_detected', 'no') so one unparseable page
    can't crash the run.

No behavior change for responses that already parse. Verified end-to-end on a 39-page
PDF with deepseek/deepseek-v4-flash (previously failed at TOC detection, now completes
at 100% accuracy).

🤖 Generated with Claude Code

extract_json() assumed the whole response is JSON and returned {} on any
parse failure, which then KeyError-crashed callers (toc_detector_single_page)
mid-index-build on models that wrap JSON in prose/fences. Add a balanced-brace
fallback that pulls the first {...}/[...] object out of the raw response, and
default toc_detector's key access so a single bad page can't abort the run.

Repros on deepseek/deepseek-v4-flash; OpenAI/glm happened to match the strict
path. Fixes intermittent 'Processing failed' on long PDFs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant