-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Describe the overall issue and situation
When attempting to follow the batch tutorial at https://github.com/google/langextract/blob/main/docs/examples/batch_api_example.md, an error occurs when reading the batch predictions with:
`Traceback (most recent call last):
File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 176, in parse_output
parsed = json.loads(content)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/json/init.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/json/decoder.py", line 338, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/json/decoder.py", line 354, in raw_decode
obj, end = self.scan_once(s, idx)
^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Unterminated string starting at: line 7 column 16 (char 163)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 260, in resolve
extraction_data = self.format_handler.parse_output(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 182, in parse_output
raise exceptions.FormatParseError(msg) from e
langextract.core.exceptions.FormatParseError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/ds/etl.py", line 33, in
etl_switch()
File "/ds/etl.py", line 26, in etl_switch
numeric_extraction.prepare_extract_intro_call_numbers()
File "/ds/scripts/numeric_extraction.py", line 221, in prepare_extract_intro_call_numbers
results = lx.extract(
^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/init.py", line 55, in extract
return extract_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/extraction.py", line 358, in extract
return list(result)
^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 255, in annotate_documents
yield from self._annotate_documents_single_pass(
File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 388, in _annotate_documents_single_pass
resolved_extractions = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 271, in resolve
raise ResolverParsingError(str(e)) from e
langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)`
It looks like it isn't setup to properly read json line files, and is expecting regular JSON.
Expected Behavior
I would expect the parser to know it is reading a JSONL file rather than regular JSON and read the predictions line by line.
Actual Behavior
The parser appears to read the JSONL file as a regular JSON file, and runs into errors.
Steps to Reproduce the Issue
After pre-processing the examples, the prompt, and the incoming inference, I attempt to run the following code with langextract 1.1.0:
# Configure batch settings batch_config = { "enabled": True, "threshold": 10, "poll_interval": 30, "timeout": 3600, "enable_caching": True, "retention_days": 30, } # Running Extraction results = lx.extract( text_or_documents=documents, prompt_description=prompt_examples.get('prompt'), examples=examples, model_id=prompt_examples.get('model'), batch_length=1000, language_model_params={ "vertexai": True, "project": "your-project-here", "location": "us-central1", "batch": batch_config, } )
The batch job is created correctly, runs through, and the predictions file is created, but then langextract fails when reading it.
Proposed Solution
You can grab the JSONL file and correctly pull out the predictions by reading them line by line, this snippet works for example:
processed_records = [] with open('lang_extract_batch_output.jsonl', 'r') as f: for i, line in enumerate(f): try: record = json.loads(line) processed_records.append(record) except: print(f'Skipping record {i} due to bad parse')
Using this, each line was placed into the "processed_records" list correctly.