Skip to content

⚡ Optimize JSON Lines (_parse_jsonl) bulk loading#118

Open
wjohns989 wants to merge 1 commit into
mainfrom
perf/jsonl-parser-optimization-13904058837291850027
Open

⚡ Optimize JSON Lines (_parse_jsonl) bulk loading#118
wjohns989 wants to merge 1 commit into
mainfrom
perf/jsonl-parser-optimization-13904058837291850027

Conversation

@wjohns989

Copy link
Copy Markdown
Owner

💡 What: Replaced the line-by-line json.loads() loop inside _parse_jsonl with a fast-path attempt that wraps the file content in a single JSON array (e.g. [{}, {}]) and decodes it via a single C-optimized json.loads call. It falls back gracefully to the original looping logic if decoding fails.

🎯 Why: The previous approach incurred significant overhead due to executing the json.loads function call for every line in the file. Bulk processing offloads the parsing completely to the underlying optimized JSON C extension.

📊 Measured Improvement: In a benchmark using 10,000 JSONL lines, the baseline execution time was 0.051 seconds, while the new bulk-parsing implementation took 0.029 seconds. This represents an approximate 43% performance improvement on pure parsing time.


PR created automatically by Jules for task 13904058837291850027 started by @wjohns989

Replaced the line-by-line `json.loads()` loop with a single bulk
parse attempt by wrapping lines in a JSON array. Falls back
gracefully to the loop if bulk loading fails.

Co-authored-by: wjohns989 <56205870+wjohns989@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd2120aaf3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".



def _parse_jsonl(path: Path) -> str:
raw_content = path.read_text(encoding="utf-8", errors="replace").strip()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stream JSONL input instead of loading whole file

Avoid reading the entire JSONL file into memory before parsing. path.read_text(...).strip() forces a full copy of the source, and the fast path then allocates another large bulk_json string; for large chat/session exports this can multiply peak memory and raise MemoryError or kill the ingestion process. The previous implementation streamed line-by-line and did not have this failure mode, so this change introduces a scalability regression in _parse_jsonl.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the _parse_jsonl function in muninn/ingestion/parser.py to introduce a bulk-parsing fast path and applies various formatting improvements across the parser module. Feedback indicates that the new JSONL parsing logic introduces a memory regression by loading entire files into memory multiple times, which could lead to significant spikes for large files. It is recommended to iterate over the file handle directly and refactor duplicated processing logic into a helper function to improve both resource efficiency and maintainability.

Comment on lines +192 to 230
raw_content = path.read_text(encoding="utf-8", errors="replace").strip()
if not raw_content:
return ""

raw_lines = [line.strip() for line in raw_content.splitlines() if line.strip()]
if not raw_lines:
return ""

lines: List[str] = []
with path.open("r", encoding="utf-8", errors="replace") as handle:
for raw in handle:
raw_line = raw.strip()
if not raw_line:
continue
try:
payload = json.loads(raw_line)
except json.JSONDecodeError:
lines.append(raw_line)
continue

# Fast path: bulk parse
bulk_json = "[" + ",".join(raw_lines) + "]"
try:
payloads = json.loads(bulk_json)
for payload in payloads:
extracted: List[str] = []
_extract_chat_lines(payload, extracted)
if extracted:
lines.extend(extracted)
else:
lines.append(json.dumps(payload, ensure_ascii=False, sort_keys=True))
return _truncate_output("\n".join(lines))
except json.JSONDecodeError:
pass

# Fallback path: loop
for raw_line in raw_lines:
try:
payload = json.loads(raw_line)
extracted = []
_extract_chat_lines(payload, extracted)
if extracted:
lines.extend(extracted)
else:
lines.append(json.dumps(payload, ensure_ascii=False, sort_keys=True))
except json.JSONDecodeError:
lines.append(raw_line)

return _truncate_output("\n".join(lines))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _parse_jsonl implementation introduces a significant memory regression and contains duplicated logic.

  1. Memory Efficiency: By using path.read_text().splitlines(), the entire file is loaded into memory as a string, then as a list of strings, and finally joined into a third large string for bulk parsing. For a 100MB file (the maximum allowed), this can result in a ~300MB peak memory spike per worker. Iterating over the file handle directly to build raw_lines avoids the first intermediate string spike.
  2. Maintainability: The logic for processing a JSON payload (extracting chat lines or falling back to json.dumps) is duplicated between the fast path and the fallback path. Refactoring this into a local helper function improves clarity and ensures consistency.

This refactor aligns with the SOTA+ philosophy of 'Production-Grade' and 'Quality Over Speed' by optimizing resource usage while improving code structure.

    with path.open("r", encoding="utf-8", errors="replace") as handle:
        raw_lines = [line.strip() for line in handle if line.strip()]

    if not raw_lines:
        return ""

    lines: List[str] = []

    def process_payload(payload):
        extracted: List[str] = []
        _extract_chat_lines(payload, extracted)
        if extracted:
            lines.extend(extracted)
        else:
            lines.append(json.dumps(payload, ensure_ascii=False, sort_keys=True))

    # Fast path: bulk parse
    bulk_json = "[" + ",".join(raw_lines) + "]"
    try:
        payloads = json.loads(bulk_json)
        for payload in payloads:
            process_payload(payload)
        return _truncate_output("\n".join(lines))
    except json.JSONDecodeError:
        pass

    # Fallback path: line-by-line loop
    for raw_line in raw_lines:
        try:
            payload = json.loads(raw_line)
            process_payload(payload)
        except json.JSONDecodeError:
            lines.append(raw_line)

    return _truncate_output("\n".join(lines))
References
  1. Adherence to SOTA+ standards of precision and quality, ensuring production-grade robustness and optimal resource usage. (link)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant