Skip to content

feat(ingestion): streaming JSON tokenizer for high-volume ingestion (…#552

Open
vaishnavidesai09 wants to merge 1 commit into
StellarFlow-Network:mainfrom
vaishnavidesai09:issue-500-streaming-json
Open

feat(ingestion): streaming JSON tokenizer for high-volume ingestion (…#552
vaishnavidesai09 wants to merge 1 commit into
StellarFlow-Network:mainfrom
vaishnavidesai09:issue-500-streaming-json

Conversation

@vaishnavidesai09

Copy link
Copy Markdown

feat(ingestion): streaming JSON tokenizer for high-volume ingestion (#500)

Closes #500

Problem

Large nested JSON payloads were loaded entirely into memory before any
price attribute could be extracted, causing spikes in RSS during
high-volume ingestion bursts.

Solution

Added iter_price_events_from_stream and build_segments_from_stream to
src/ingestion/parser.py. Both functions use ijson's SAX-style event
emitter (ijson.parse) to walk the token stream incrementally — only the
current candidate frame dict is assembled at any point, so memory usage
is O(frame) rather than O(document).

API

from ingestion.parser import iter_price_events_from_stream, build_segments_from_stream

# file-like object or raw bytes
for asset_id, price, ts, seq, flags in iter_price_events_from_stream(response.raw):
    ...

segments = build_segments_from_stream(open("dump.json", "rb"), segment_size=256)

What's unchanged

  • iter_flat_ticker_tuples, flatten_telemetry_frames, build_telemetry_segments — untouched
  • src/ingestion/parsers.py compat re-exports updated to include new symbols

Tests

13 new tests in StreamingParserTests; all 17 tests pass.

@drips-wave

drips-wave Bot commented Jun 28, 2026

Copy link
Copy Markdown

@vaishnavidesai09 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@vaishnavidesai09

Copy link
Copy Markdown
Author

Hi @Sadeequ! Here's my PR for issue #500.
Refactored src/ingestion/parser.py to add iter_price_events_from_stream and build_segments_from_stream which use ijson's SAX-style tokenizer to parse JSON incrementally — memory usage stays bounded to a single frame dict regardless of document size. Updated parsers.py compat exports and added 13 new tests (17 total, all passing).
Please let me know if any changes are needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

📥 Ingestion-Throughput | Streaming JSON Tokenizers for High-Volume Ingestions

1 participant