Filed from PR #325 on VectifyAI/PageIndex, which I am closing because I am moving my fork to a private repo. The change is captured below (description + full diff) so it stays available for review. Happy to open a fresh PR if a maintainer wants to take it.
Problem
extract_json() assumes the entire model response is JSON and returns {} on any
JSONDecodeError. Callers then access keys directly (e.g.
toc_detector_single_page does json_content['toc_detected']), so a single
response with stray prose/code-fences around the JSON raises KeyError and aborts
the whole index build with Processing failed.
This reproduces intermittently on models that don't return bare JSON — e.g.
deepseek/deepseek-v4-flash via LiteLLM. OpenAI/GLM happened to match the strict
path, so it wasn't caught.
Fix
extract_json: add a balanced-brace fallback (_extract_balanced_json) that pulls
the first {...}/[...] object out of the raw response when direct parsing fails —
handles models that wrap JSON in prose or fences.
toc_detector_single_page: use .get('toc_detected', 'no') so one unparseable page
can't crash the run.
No behavior change for responses that already parse. Verified end-to-end on a 39-page
PDF with deepseek/deepseek-v4-flash (previously failed at TOC detection, now completes
at 100% accuracy).
🤖 Generated with Claude Code
Full diff (from the closed PR)
diff --git a/pageindex/page_index.py b/pageindex/page_index.py
index 9004309..fe462e2 100644
--- a/pageindex/page_index.py
+++ b/pageindex/page_index.py
@@ -118,8 +118,8 @@ def toc_detector_single_page(content, model=None):
response = llm_completion(model=model, prompt=prompt)
# print('response', response)
- json_content = extract_json(response)
- return json_content['toc_detected']
+ json_content = extract_json(response)
+ return json_content.get('toc_detected', 'no')
def check_if_toc_extraction_is_complete(content, toc, model=None):
diff --git a/pageindex/utils.py b/pageindex/utils.py
index f00ccf3..5e8bced 100644
--- a/pageindex/utils.py
+++ b/pageindex/utils.py
@@ -96,6 +96,44 @@ def get_json_content(response):
return json_content
+def _extract_balanced_json(text):
+ """Find and parse the first balanced {...} or [...] object in text.
+
+ Robustness fallback for models (e.g. DeepSeek) that occasionally wrap the
+ JSON in prose or code fences instead of returning it bare. Returns the
+ parsed object, or None if nothing parseable is found.
+ """
+ for open_ch, close_ch in (('{', '}'), ('[', ']')):
+ start = text.find(open_ch)
+ if start == -1:
+ continue
+ depth = 0
+ in_str = False
+ esc = False
+ for i in range(start, len(text)):
+ ch = text[i]
+ if in_str:
+ if esc:
+ esc = False
+ elif ch == '\\':
+ esc = True
+ elif ch == '"':
+ in_str = False
+ continue
+ if ch == '"':
+ in_str = True
+ elif ch == open_ch:
+ depth += 1
+ elif ch == close_ch:
+ depth -= 1
+ if depth == 0:
+ try:
+ return json.loads(text[start:i + 1])
+ except json.JSONDecodeError:
+ break
+ return None
+
+
def extract_json(content):
try:
# First, try to extract JSON enclosed within ```json and ```
@@ -122,7 +160,12 @@ def extract_json(content):
# Remove any trailing commas before closing brackets/braces
json_content = json_content.replace(',]', ']').replace(',}', '}')
return json.loads(json_content)
- except:
+ except json.JSONDecodeError:
+ # Last resort: pull the first balanced JSON object out of the raw
+ # response (handles models that add prose/fences around the JSON).
+ obj = _extract_balanced_json(content)
+ if obj is not None:
+ return obj
logging.error("Failed to parse JSON even after cleanup")
return {}
except Exception as e:
Problem
extract_json()assumes the entire model response is JSON and returns{}on anyJSONDecodeError. Callers then access keys directly (e.g.toc_detector_single_pagedoesjson_content['toc_detected']), so a singleresponse with stray prose/code-fences around the JSON raises
KeyErrorand abortsthe whole index build with
Processing failed.This reproduces intermittently on models that don't return bare JSON — e.g.
deepseek/deepseek-v4-flashvia LiteLLM. OpenAI/GLM happened to match the strictpath, so it wasn't caught.
Fix
extract_json: add a balanced-brace fallback (_extract_balanced_json) that pullsthe first
{...}/[...]object out of the raw response when direct parsing fails —handles models that wrap JSON in prose or fences.
toc_detector_single_page: use.get('toc_detected', 'no')so one unparseable pagecan't crash the run.
No behavior change for responses that already parse. Verified end-to-end on a 39-page
PDF with
deepseek/deepseek-v4-flash(previously failed at TOC detection, now completesat 100% accuracy).
🤖 Generated with Claude Code
Full diff (from the closed PR)