Skip to content

Bug/Fix: extract_json crashes index build on non-strict model JSON (e.g. DeepSeek) #326

@cnndabbler

Description

@cnndabbler

Filed from PR #325 on VectifyAI/PageIndex, which I am closing because I am moving my fork to a private repo. The change is captured below (description + full diff) so it stays available for review. Happy to open a fresh PR if a maintainer wants to take it.


Problem

extract_json() assumes the entire model response is JSON and returns {} on any
JSONDecodeError. Callers then access keys directly (e.g.
toc_detector_single_page does json_content['toc_detected']), so a single
response with stray prose/code-fences around the JSON raises KeyError and aborts
the whole index build with Processing failed.

This reproduces intermittently on models that don't return bare JSON — e.g.
deepseek/deepseek-v4-flash via LiteLLM. OpenAI/GLM happened to match the strict
path, so it wasn't caught.

Fix

  • extract_json: add a balanced-brace fallback (_extract_balanced_json) that pulls
    the first {...}/[...] object out of the raw response when direct parsing fails —
    handles models that wrap JSON in prose or fences.
  • toc_detector_single_page: use .get('toc_detected', 'no') so one unparseable page
    can't crash the run.

No behavior change for responses that already parse. Verified end-to-end on a 39-page
PDF with deepseek/deepseek-v4-flash (previously failed at TOC detection, now completes
at 100% accuracy).

🤖 Generated with Claude Code

Full diff (from the closed PR)
diff --git a/pageindex/page_index.py b/pageindex/page_index.py
index 9004309..fe462e2 100644
--- a/pageindex/page_index.py
+++ b/pageindex/page_index.py
@@ -118,8 +118,8 @@ def toc_detector_single_page(content, model=None):
 
     response = llm_completion(model=model, prompt=prompt)
     # print('response', response)
-    json_content = extract_json(response)    
-    return json_content['toc_detected']
+    json_content = extract_json(response)
+    return json_content.get('toc_detected', 'no')
 
 
 def check_if_toc_extraction_is_complete(content, toc, model=None):
diff --git a/pageindex/utils.py b/pageindex/utils.py
index f00ccf3..5e8bced 100644
--- a/pageindex/utils.py
+++ b/pageindex/utils.py
@@ -96,6 +96,44 @@ def get_json_content(response):
     return json_content
          
 
+def _extract_balanced_json(text):
+    """Find and parse the first balanced {...} or [...] object in text.
+
+    Robustness fallback for models (e.g. DeepSeek) that occasionally wrap the
+    JSON in prose or code fences instead of returning it bare. Returns the
+    parsed object, or None if nothing parseable is found.
+    """
+    for open_ch, close_ch in (('{', '}'), ('[', ']')):
+        start = text.find(open_ch)
+        if start == -1:
+            continue
+        depth = 0
+        in_str = False
+        esc = False
+        for i in range(start, len(text)):
+            ch = text[i]
+            if in_str:
+                if esc:
+                    esc = False
+                elif ch == '\\':
+                    esc = True
+                elif ch == '"':
+                    in_str = False
+                continue
+            if ch == '"':
+                in_str = True
+            elif ch == open_ch:
+                depth += 1
+            elif ch == close_ch:
+                depth -= 1
+                if depth == 0:
+                    try:
+                        return json.loads(text[start:i + 1])
+                    except json.JSONDecodeError:
+                        break
+    return None
+
+
 def extract_json(content):
     try:
         # First, try to extract JSON enclosed within ```json and ```
@@ -122,7 +160,12 @@ def extract_json(content):
             # Remove any trailing commas before closing brackets/braces
             json_content = json_content.replace(',]', ']').replace(',}', '}')
             return json.loads(json_content)
-        except:
+        except json.JSONDecodeError:
+            # Last resort: pull the first balanced JSON object out of the raw
+            # response (handles models that add prose/fences around the JSON).
+            obj = _extract_balanced_json(content)
+            if obj is not None:
+                return obj
             logging.error("Failed to parse JSON even after cleanup")
             return {}
     except Exception as e:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions