Skip to content

_call_claude_p_judge: strip markdown code fences from envelope.result before json.loads #1065

@akaszubski

Description

@akaszubski

Bug

_call_claude_p_judge in scripts/extract_and_label_intent_corpus.py calls json.loads(envelope["result"].strip()) directly, but Claude often wraps JSON responses in markdown code fences:

```
"result": "```json\n{"intent":"implement","confidence":0.95}\n```"
```

json.loads on the fenced string raises JSONDecodeError, the function returns None, and the prompt is silently dropped as judge_failure.

Reproduction (post-#1064 fix)

```bash
$ python3 -c "
import sys; sys.path.insert(0, 'scripts')
from extract_and_label_intent_corpus import _call_claude_p_judge
print(_call_claude_p_judge('implement pagination', model='claude-haiku-4-5-20251001'))
"
None
```

Manual subprocess call confirms claude -p works (returncode=0, is_error=false), but the result field contains:

```
"result":"```json\n{"intent":"implement","confidence":0.95}\n```"
```

Why existing tests didn't catch it

All unit tests provide clean unfenced JSON in their mocked envelopes:
```python
fake_envelope = {"result": '{"intent": "implement", "confidence": 0.9}', ...}
```
So the markdown-fence behavior was never exercised at the unit-test layer.

Fix options (1-line each)

Option A: Strip markdown fences defensively before parsing:

```python
raw = envelope.get("result")
if not isinstance(raw, str):
return None
raw = raw.strip()

Strip markdown code fences if present

if raw.startswith("```"):
raw = raw.split("\n", 1)[1] if "\n" in raw else raw
if raw.endswith("```"):
raw = raw.rsplit("```", 1)[0]
raw = raw.strip()
try:
parsed = json.loads(raw)
except json.JSONDecodeError:
return None
```

Option B: Strengthen the system prompt + use `--json-schema`:

```python
cmd = [
"claude", "-p",
"--output-format", "json",
"--json-schema", '{"type":"object","properties":{"intent":{"type":"string","enum":[...13 classes]},"confidence":{"type":"number"}},"required":["intent","confidence"]}',
"--model", model,
"--max-turns", "1",
"--system-prompt", _JUDGE_SYSTEM_PROMPT,
]
```

Then read `envelope["structured_output"]` instead of parsing `envelope["result"]`. Eliminates the markdown-fence problem entirely because Anthropic enforces the schema server-side.

Recommendation

Option B is more robust (server-side schema enforcement, no client-side regex parsing) but adds a longer command line. Option A is the minimal fix.

For now: Option A as a quick fix, possibly migrate to Option B in a follow-up.

Regression test

Add a test where the mocked envelope has `"result": "```json\n{...}\n```"` and assert `_call_claude_p_judge` returns the parsed dict (not None).

Severity

HIGH — the corpus refresh (the original use case for the refactor) cannot complete without this fix. Combined with #1064, this is the second silent failure in the refactor.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions