Skip to content

batch langextract: langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at #287

@eakertFacet

Description

@eakertFacet

Describe the overall issue and situation

When attempting to follow the batch tutorial at https://github.com/google/langextract/blob/main/docs/examples/batch_api_example.md, an error occurs when reading the batch predictions with:

`Traceback (most recent call last):
File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 176, in parse_output
parsed = json.loads(content)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/json/init.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/json/decoder.py", line 338, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/json/decoder.py", line 354, in raw_decode
obj, end = self.scan_once(s, idx)
^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Unterminated string starting at: line 7 column 16 (char 163)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 260, in resolve
extraction_data = self.format_handler.parse_output(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 182, in parse_output
raise exceptions.FormatParseError(msg) from e
langextract.core.exceptions.FormatParseError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/ds/etl.py", line 33, in
etl_switch()
File "/ds/etl.py", line 26, in etl_switch
numeric_extraction.prepare_extract_intro_call_numbers()
File "/ds/scripts/numeric_extraction.py", line 221, in prepare_extract_intro_call_numbers
results = lx.extract(
^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/init.py", line 55, in extract
return extract_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/extraction.py", line 358, in extract
return list(result)
^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 255, in annotate_documents
yield from self._annotate_documents_single_pass(
File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 388, in _annotate_documents_single_pass
resolved_extractions = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 271, in resolve
raise ResolverParsingError(str(e)) from e
langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)`

It looks like it isn't setup to properly read json line files, and is expecting regular JSON.

Expected Behavior
I would expect the parser to know it is reading a JSONL file rather than regular JSON and read the predictions line by line.

Actual Behavior
The parser appears to read the JSONL file as a regular JSON file, and runs into errors.

Steps to Reproduce the Issue
After pre-processing the examples, the prompt, and the incoming inference, I attempt to run the following code with langextract 1.1.0:
# Configure batch settings batch_config = { "enabled": True, "threshold": 10, "poll_interval": 30, "timeout": 3600, "enable_caching": True, "retention_days": 30, } # Running Extraction results = lx.extract( text_or_documents=documents, prompt_description=prompt_examples.get('prompt'), examples=examples, model_id=prompt_examples.get('model'), batch_length=1000, language_model_params={ "vertexai": True, "project": "your-project-here", "location": "us-central1", "batch": batch_config, } )

The batch job is created correctly, runs through, and the predictions file is created, but then langextract fails when reading it.

Proposed Solution
You can grab the JSONL file and correctly pull out the predictions by reading them line by line, this snippet works for example:
processed_records = [] with open('lang_extract_batch_output.jsonl', 'r') as f: for i, line in enumerate(f): try: record = json.loads(line) processed_records.append(record) except: print(f'Skipping record {i} due to bad parse')

Using this, each line was placed into the "processed_records" list correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions