batch langextract: langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at

**Describe the overall issue and situation**

When attempting to follow the batch tutorial at https://github.com/google/langextract/blob/main/docs/examples/batch_api_example.md, an error occurs when reading the batch predictions with:

`Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 176, in parse_output
    parsed = json.loads(content)
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/json/decoder.py", line 354, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Unterminated string starting at: line 7 column 16 (char 163)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 260, in resolve
    extraction_data = self.format_handler.parse_output(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 182, in parse_output
    raise exceptions.FormatParseError(msg) from e
langextract.core.exceptions.FormatParseError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/ds/etl.py", line 33, in <module>
    etl_switch()
  File "/ds/etl.py", line 26, in etl_switch
    numeric_extraction.prepare_extract_intro_call_numbers()
  File "/ds/scripts/numeric_extraction.py", line 221, in prepare_extract_intro_call_numbers
    results = lx.extract(
              ^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/langextract/__init__.py", line 55, in extract
    return extract_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/langextract/extraction.py", line 358, in extract
    return list(result)
           ^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 255, in annotate_documents
    yield from self._annotate_documents_single_pass(
  File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 388, in _annotate_documents_single_pass
    resolved_extractions = resolver.resolve(
                           ^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 271, in resolve
    raise ResolverParsingError(str(e)) from e
langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)`

It looks like it isn't setup to properly read json line files, and is expecting regular JSON.

**Expected Behavior**
I would expect the parser to know it is reading a JSONL file rather than regular JSON and read the predictions line by line.

**Actual Behavior**
The parser appears to read the JSONL file as a regular JSON file, and runs into errors.

**Steps to Reproduce the Issue**
After pre-processing the examples, the prompt, and the incoming inference, I attempt to run the following code with langextract 1.1.0:
`# Configure batch settings
    batch_config = {
        "enabled": True,
        "threshold": 10,
        "poll_interval": 30,
        "timeout": 3600,
        "enable_caching": True,
        "retention_days": 30,
    }
    # Running Extraction
    results = lx.extract(
        text_or_documents=documents,
        prompt_description=prompt_examples.get('prompt'),
        examples=examples,
        model_id=prompt_examples.get('model'),
        batch_length=1000,
        language_model_params={
            "vertexai": True,
            "project": "your-project-here",
            "location": "us-central1",
            "batch": batch_config,
        }
    )`

The batch job is created correctly, runs through, and the predictions file is created, but then langextract fails when reading it.

**Proposed Solution**
You can grab the JSONL file and correctly pull out the predictions by reading them line by line, this snippet works for example:
`processed_records = []
with open('lang_extract_batch_output.jsonl', 'r') as f:
    for i, line in enumerate(f):
        try:
            record = json.loads(line)
            processed_records.append(record)
        except:
            print(f'Skipping record {i} due to bad parse')`

Using this, each line was placed into the "processed_records" list correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

batch langextract: langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at #287

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

batch langextract: langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at #287

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions