Bug/Fix: extract_json crashes index build on non-strict model JSON (e.g. DeepSeek)

> Filed from PR #325 on `VectifyAI/PageIndex`, which I am closing because I am moving my fork to a private repo. The change is captured below (description + full diff) so it stays available for review. Happy to open a fresh PR if a maintainer wants to take it.

---

## Problem

`extract_json()` assumes the entire model response is JSON and returns `{}` on any
`JSONDecodeError`. Callers then access keys directly (e.g.
`toc_detector_single_page` does `json_content['toc_detected']`), so a single
response with stray prose/code-fences around the JSON raises `KeyError` and aborts
the whole index build with `Processing failed`.

This reproduces intermittently on models that don't return bare JSON — e.g.
`deepseek/deepseek-v4-flash` via LiteLLM. OpenAI/GLM happened to match the strict
path, so it wasn't caught.

## Fix

- `extract_json`: add a balanced-brace fallback (`_extract_balanced_json`) that pulls
  the first `{...}`/`[...]` object out of the raw response when direct parsing fails —
  handles models that wrap JSON in prose or fences.
- `toc_detector_single_page`: use `.get('toc_detected', 'no')` so one unparseable page
  can't crash the run.

No behavior change for responses that already parse. Verified end-to-end on a 39-page
PDF with `deepseek/deepseek-v4-flash` (previously failed at TOC detection, now completes
at 100% accuracy).

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<details>
<summary><b>Full diff</b> (from the closed PR)</summary>

```diff
diff --git a/pageindex/page_index.py b/pageindex/page_index.py
index 9004309..fe462e2 100644
--- a/pageindex/page_index.py
+++ b/pageindex/page_index.py
@@ -118,8 +118,8 @@ def toc_detector_single_page(content, model=None):
 
     response = llm_completion(model=model, prompt=prompt)
     # print('response', response)
-    json_content = extract_json(response)    
-    return json_content['toc_detected']
+    json_content = extract_json(response)
+    return json_content.get('toc_detected', 'no')
 
 
 def check_if_toc_extraction_is_complete(content, toc, model=None):
diff --git a/pageindex/utils.py b/pageindex/utils.py
index f00ccf3..5e8bced 100644
--- a/pageindex/utils.py
+++ b/pageindex/utils.py
@@ -96,6 +96,44 @@ def get_json_content(response):
     return json_content
          
 
+def _extract_balanced_json(text):
+    """Find and parse the first balanced {...} or [...] object in text.
+
+    Robustness fallback for models (e.g. DeepSeek) that occasionally wrap the
+    JSON in prose or code fences instead of returning it bare. Returns the
+    parsed object, or None if nothing parseable is found.
+    """
+    for open_ch, close_ch in (('{', '}'), ('[', ']')):
+        start = text.find(open_ch)
+        if start == -1:
+            continue
+        depth = 0
+        in_str = False
+        esc = False
+        for i in range(start, len(text)):
+            ch = text[i]
+            if in_str:
+                if esc:
+                    esc = False
+                elif ch == '\\':
+                    esc = True
+                elif ch == '"':
+                    in_str = False
+                continue
+            if ch == '"':
+                in_str = True
+            elif ch == open_ch:
+                depth += 1
+            elif ch == close_ch:
+                depth -= 1
+                if depth == 0:
+                    try:
+                        return json.loads(text[start:i + 1])
+                    except json.JSONDecodeError:
+                        break
+    return None
+
+
 def extract_json(content):
     try:
         # First, try to extract JSON enclosed within ```json and ```
@@ -122,7 +160,12 @@ def extract_json(content):
             # Remove any trailing commas before closing brackets/braces
             json_content = json_content.replace(',]', ']').replace(',}', '}')
             return json.loads(json_content)
-        except:
+        except json.JSONDecodeError:
+            # Last resort: pull the first balanced JSON object out of the raw
+            # response (handles models that add prose/fences around the JSON).
+            obj = _extract_balanced_json(content)
+            if obj is not None:
+                return obj
             logging.error("Failed to parse JSON even after cleanup")
             return {}
     except Exception as e:

```

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug/Fix: extract_json crashes index build on non-strict model JSON (e.g. DeepSeek) #326

Problem

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug/Fix: extract_json crashes index build on non-strict model JSON (e.g. DeepSeek) #326

Description

Problem

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions