Skip to content

Conversation

Alex-PLACET
Copy link
Member

No description provided.

@Alex-PLACET Alex-PLACET requested a review from Copilot October 3, 2025 14:00
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduces a new streaming-based serialization architecture replacing vector-returning functions with pluggable output_stream abstractions (memory, file, chunked) and adds higher-level serializer / chunk_serializer utilities. Refactors flatbuffer construction logic into flatbuffer_utils, moves type mapping out of utils, and adds size‑estimation helpers for preallocation.

  • Added output_stream interface with memory, file, and chunked implementations plus new (chunk_)serializer classes
  • Refactored flatbuffer and body serialization into modular helpers; functions now write directly to streams
  • Added size estimation utilities (calculate_*_message_size / calculate_total_serialized_size) and extensive new tests

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
src/utils.cpp Removes flatbuffer type logic; retains parsing & alignment helpers
src/flatbuffer_utils.cpp / include/sparrow_ipc/flatbuffer_utils.hpp New central flatbuffer construction and buffer/node utilities
src/serialize_utils.cpp / include/sparrow_ipc/serialize_utils.hpp Stream-oriented serialization helpers and size calculators
src/serialize.cpp / include/sparrow_ipc/serialize.hpp Stream-based schema & record batch serialization entry points
src/serializer.cpp / include/sparrow_ipc/serializer.hpp Adds serializer class for continuous IPC stream writing
src/chunk_memory_serializer.cpp / include/sparrow_ipc/chunk_memory_serializer.hpp Adds chunked (per-message vector) serialization
include/sparrow_ipc/output_stream.hpp Defines abstract output_stream interface
include/sparrow_ipc/memory_output_stream.hpp In-memory implementation
include/sparrow_ipc/file_output_stream.hpp / src/file_output_stream.cpp File-backed implementation
include/sparrow_ipc/chunk_memory_output_stream.hpp Chunked multi-vector output stream (name misspelled)
tests/* Updated & expanded tests for new streaming APIs and utilities
CMakeLists.txt / tests/CMakeLists.txt Adds new sources & headers to build system
include/sparrow_ipc/utils.hpp API changes: align_to_8 signature and new parse_format exposure
Other headers Minor adjustments (e.g., added )

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 38 to 47
template <std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
serializer(const R& record_batches, output_stream& stream)
: m_pstream(&stream)
, m_dtypes(get_column_dtypes(record_batches[0]))
{
if (record_batches.empty())
{
throw std::invalid_argument("Record batches collection is empty");
}
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-batch constructor writes the first record batch twice: once via append(record_batches) (which iterates all elements) after already serializing record_batches[0]. This duplicates the first batch in the output. Either start the append loop from the second element or exclude record_batches[0] from append.

Copilot uses AI. Check for mistakes.

};
m_pstream->reserve(reserve_function);
serialize_schema_message(record_batches[0], *m_pstream);
append(record_batches);
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-batch constructor writes the first record batch twice: once via append(record_batches) (which iterates all elements) after already serializing record_batches[0]. This duplicates the first batch in the output. Either start the append loop from the second element or exclude record_batches[0] from append.

Suggested change
append(record_batches);
if (std::ranges::distance(record_batches) > 1) {
append(std::ranges::subrange(std::next(record_batches.begin()), record_batches.end()));
}

Copilot uses AI. Check for mistakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we first serialize the schema, then we serialize the arrays

Comment on lines 21 to 33
template <std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
chunk_serializer(
const R& record_batches,
chuncked_memory_output_stream<std::vector<std::vector<uint8_t>>>& stream
)
: m_pstream(&stream)
{
if (record_batches.empty())
{
throw std::invalid_argument("Record batches collection is empty");
}
m_dtypes = get_column_dtypes(record_batches[0]);
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same duplication issue: the first record batch is serialized twice (once explicitly and once through append(record_batches)). Adjust append to skip the first element or iterate from the second batch here.

Copilot uses AI. Check for mistakes.

@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 89.48413% with 53 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@25a9ef6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/serializer.cpp 40.00% 21 Missing ⚠️
src/flatbuffer_utils.cpp 91.74% 18 Missing ⚠️
include/sparrow_ipc/chunk_memory_output_stream.hpp 90.90% 3 Missing ⚠️
include/sparrow_ipc/output_stream.hpp 70.00% 3 Missing ⚠️
include/sparrow_ipc/serializer.hpp 90.62% 3 Missing ⚠️
src/chunk_memory_serializer.cpp 93.54% 2 Missing ⚠️
src/file_output_stream.cpp 93.54% 2 Missing ⚠️
src/serialize_utils.cpp 96.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #29   +/-   ##
=======================================
  Coverage        ?   78.46%           
=======================================
  Files           ?       32           
  Lines           ?     1291           
  Branches        ?        0           
=======================================
  Hits            ?     1013           
  Misses          ?      278           
  Partials        ?        0           
Flag Coverage Δ
unittests 78.46% <89.48%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alex-PLACET Alex-PLACET marked this pull request as ready for review October 7, 2025 11:05
@Alex-PLACET Alex-PLACET force-pushed the add_output_stream_and_serializers branch from 14952d0 to 75c9827 Compare October 7, 2025 12:41
CHECK_EQ(utils::align_to_8(15), 16);
CHECK_EQ(utils::align_to_8(16), 16);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to another test file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from another test file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants