Bug
_call_claude_p_judge in scripts/extract_and_label_intent_corpus.py calls json.loads(envelope["result"].strip()) directly, but Claude often wraps JSON responses in markdown code fences:
```
"result": "```json\n{"intent":"implement","confidence":0.95}\n```"
```
json.loads on the fenced string raises JSONDecodeError, the function returns None, and the prompt is silently dropped as judge_failure.
Reproduction (post-#1064 fix)
```bash
$ python3 -c "
import sys; sys.path.insert(0, 'scripts')
from extract_and_label_intent_corpus import _call_claude_p_judge
print(_call_claude_p_judge('implement pagination', model='claude-haiku-4-5-20251001'))
"
None
```
Manual subprocess call confirms claude -p works (returncode=0, is_error=false), but the result field contains:
```
"result":"```json\n{"intent":"implement","confidence":0.95}\n```"
```
Why existing tests didn't catch it
All unit tests provide clean unfenced JSON in their mocked envelopes:
```python
fake_envelope = {"result": '{"intent": "implement", "confidence": 0.9}', ...}
```
So the markdown-fence behavior was never exercised at the unit-test layer.
Fix options (1-line each)
Option A: Strip markdown fences defensively before parsing:
```python
raw = envelope.get("result")
if not isinstance(raw, str):
return None
raw = raw.strip()
Strip markdown code fences if present
if raw.startswith("```"):
raw = raw.split("\n", 1)[1] if "\n" in raw else raw
if raw.endswith("```"):
raw = raw.rsplit("```", 1)[0]
raw = raw.strip()
try:
parsed = json.loads(raw)
except json.JSONDecodeError:
return None
```
Option B: Strengthen the system prompt + use `--json-schema`:
```python
cmd = [
"claude", "-p",
"--output-format", "json",
"--json-schema", '{"type":"object","properties":{"intent":{"type":"string","enum":[...13 classes]},"confidence":{"type":"number"}},"required":["intent","confidence"]}',
"--model", model,
"--max-turns", "1",
"--system-prompt", _JUDGE_SYSTEM_PROMPT,
]
```
Then read `envelope["structured_output"]` instead of parsing `envelope["result"]`. Eliminates the markdown-fence problem entirely because Anthropic enforces the schema server-side.
Recommendation
Option B is more robust (server-side schema enforcement, no client-side regex parsing) but adds a longer command line. Option A is the minimal fix.
For now: Option A as a quick fix, possibly migrate to Option B in a follow-up.
Regression test
Add a test where the mocked envelope has `"result": "```json\n{...}\n```"` and assert `_call_claude_p_judge` returns the parsed dict (not None).
Severity
HIGH — the corpus refresh (the original use case for the refactor) cannot complete without this fix. Combined with #1064, this is the second silent failure in the refactor.
Related
Bug
_call_claude_p_judgeinscripts/extract_and_label_intent_corpus.pycallsjson.loads(envelope["result"].strip())directly, but Claude often wraps JSON responses in markdown code fences:```
"result": "```json\n{"intent":"implement","confidence":0.95}\n```"
```
json.loadson the fenced string raisesJSONDecodeError, the function returnsNone, and the prompt is silently dropped asjudge_failure.Reproduction (post-#1064 fix)
```bash
$ python3 -c "
import sys; sys.path.insert(0, 'scripts')
from extract_and_label_intent_corpus import _call_claude_p_judge
print(_call_claude_p_judge('implement pagination', model='claude-haiku-4-5-20251001'))
"
None
```
Manual subprocess call confirms claude -p works (returncode=0, is_error=false), but the result field contains:
```
"result":"```json\n{"intent":"implement","confidence":0.95}\n```"
```
Why existing tests didn't catch it
All unit tests provide clean unfenced JSON in their mocked envelopes:
```python
fake_envelope = {"result": '{"intent": "implement", "confidence": 0.9}', ...}
```
So the markdown-fence behavior was never exercised at the unit-test layer.
Fix options (1-line each)
Option A: Strip markdown fences defensively before parsing:
```python
raw = envelope.get("result")
if not isinstance(raw, str):
return None
raw = raw.strip()
Strip markdown code fences if present
if raw.startswith("```"):
raw = raw.split("\n", 1)[1] if "\n" in raw else raw
if raw.endswith("```"):
raw = raw.rsplit("```", 1)[0]
raw = raw.strip()
try:
parsed = json.loads(raw)
except json.JSONDecodeError:
return None
```
Option B: Strengthen the system prompt + use `--json-schema`:
```python
cmd = [
"claude", "-p",
"--output-format", "json",
"--json-schema", '{"type":"object","properties":{"intent":{"type":"string","enum":[...13 classes]},"confidence":{"type":"number"}},"required":["intent","confidence"]}',
"--model", model,
"--max-turns", "1",
"--system-prompt", _JUDGE_SYSTEM_PROMPT,
]
```
Then read `envelope["structured_output"]` instead of parsing `envelope["result"]`. Eliminates the markdown-fence problem entirely because Anthropic enforces the schema server-side.
Recommendation
Option B is more robust (server-side schema enforcement, no client-side regex parsing) but adds a longer command line. Option A is the minimal fix.
For now: Option A as a quick fix, possibly migrate to Option B in a follow-up.
Regression test
Add a test where the mocked envelope has `"result": "```json\n{...}\n```"` and assert `_call_claude_p_judge` returns the parsed dict (not None).
Severity
HIGH — the corpus refresh (the original use case for the refactor) cannot complete without this fix. Combined with #1064, this is the second silent failure in the refactor.
Related