Handle lz4 compression #30

Hind-M · 2025-10-07T08:05:32Z

Only handling lz4 codec for now (zstd will be handled in a next PR).

codecov-commenter · 2025-10-08T12:08:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 89.65517% with 24 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@04433e7). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/compression.cpp	85.50%	10 Missing ⚠️
src/arrow_interface/arrow_array.cpp	63.15%	7 Missing ⚠️
...row_ipc/deserialize_variable_size_binary_array.hpp	66.66%	4 Missing ⚠️
include/sparrow_ipc/chunk_memory_serializer.hpp	66.66%	1 Missing ⚠️
src/deserialize_fixedsizebinary_array.cpp	0.00%	1 Missing ⚠️
src/deserialize_utils.cpp	95.83%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #30   +/-   ##
=======================================
  Coverage        ?   77.52%           
=======================================
  Files           ?       32           
  Lines           ?     1477           
  Branches        ?        0           
=======================================
  Hits            ?     1145           
  Misses          ?      332           
  Partials        ?        0

Flag	Coverage Δ
unittests	`77.52% <89.65%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…cleaned up)

Hind-M · 2025-10-16T12:04:25Z

There are still a few things missing and some room for improvement, but I suggest merging this ASAP to avoid further conflicts (I just resolved the ones after merging #29 and had to rework the compression in the serialization part. For now, this PR is just to get something working).
I'll address the TODO items in a follow-up PR. Any comments or reviews regarding design choices can be noted, but they will be addressed in the next PRs.

Alex-PLACET · 2025-10-16T12:44:26Z

CAn you add tests which only test the compression/decompression of a buffer ?

Alex-PLACET · 2025-10-16T12:25:14Z

include/sparrow_ipc/chunk_memory_serializer.hpp

+         * @param compression Optional: The compression type to use for record batch bodies.
         */
-        chunk_serializer(chunked_memory_output_stream<std::vector<std::vector<uint8_t>>>& stream);
+        chunk_serializer(chunked_memory_output_stream<std::vector<std::vector<uint8_t>>>& stream, std::optional<org::apache::arrow::flatbuf::CompressionType> compression = std::nullopt);


Create a sparrow-ipc enum to keep public signatures free from flatbuffers

src/compression.cpp

Alex-PLACET · 2025-10-16T13:00:31Z

include/sparrow_ipc/deserialize_variable_size_binary_array.hpp

+        auto validity_buffer_span = utils::get_and_decompress_buffer(record_batch, body, buffer_index, compression, decompressed_buffers);
+
+        const auto [bitmap_ptr, null_count] = utils::get_bitmap_pointer_and_null_count(validity_buffer_span, record_batch.length());
+
+        auto offset_buffer_span = utils::get_and_decompress_buffer(record_batch, body, buffer_index, compression, decompressed_buffers);
+        auto data_buffer_span = utils::get_and_decompress_buffer(record_batch, body, buffer_index, compression, decompressed_buffers);
+


I think that trying to not have a branch for the compression and another when there is no compression leads code complexity and a lack of visibility.
IMO, we should have two function. One which handle the compressed data and another when there is no compression.
Thanks to this separation, you will not have get_and_decompress_buffer with this strange behavior.
get_and_decompress_buffer should be transformed into a get_uncompressed_data which returns a std::vector<uint_8>, and the caller move the result buffers to a vector of buffer.
And you keep the original code for uncompressed data.

(same for the deserialize_primitive_array.hpp btw)

include/sparrow_ipc/chunk_memory_serializer.hpp

src/serialize.cpp

Alex-PLACET · 2025-10-16T13:14:25Z

src/serialize_utils.cpp

+        if (compression.has_value())
+        {
+            // If compressed, the body size is the sum of compressed buffer sizes + original size prefixes + padding
+            auto [compressed_body, compressed_buffers] = generate_compressed_body_and_buffers(record_batch, compression.value());


We don't want to compress the data to do the calculation of the size of the message.
I saw that LZ4F_compressFrameBound can give the maximum size of the compressed buffer. It think this should be used instead.

No, you want the exact size, not the maximum size (which is going to be some trivial calculation such as uncompressed size + K).

This function is used for the memory reservation, not for the message header.
Can you know the compressed size without compressing the data first ?

No, you can't.

We will have an issue with the fill_buffers function in flatbuffer_utils.cpp
In this function we create the flatbuffer::Buffer which are the offset and size of each buffer in the body.
As the sizes of the buffers are unknow before the data is compressed, you can't create the record_batch message.
It means that we have to compress the buffers before to create the message and keep the compressed buffers in memory.
Once all the buffers are compressed, we can finally create and send the record_batch message

BTW a test where we try to deserialize ou serialized with compression is missing. It should not work because of what I said in the previous message.

I'm thinking if we should split the code execution in two different branches for compressed vs uncompressed buffers when we create record batch messages.
Trying to keep the same code path seems to lead to code complexity without so much benefit.

I'm thinking if we should split the code execution in two different branches for compressed vs uncompressed buffers when we create record batch messages.

No, this is really a bad idea because you don't want to write buffers as compressed when they are not compressible (see my other comments about this).

environment-dev.yml

pitrou · 2025-10-16T12:40:07Z

include/sparrow_ipc/chunk_memory_serializer.hpp

+
 #include <sparrow/record_batch.hpp>

+#include "Message_generated.h"


For the record, in Arrow C++ we ensure that flatbuffers headers (and any other dependency) are not exposed through public Arrow headers.

include/sparrow_ipc/chunk_memory_serializer.hpp

pitrou · 2025-10-16T12:52:06Z

include/sparrow_ipc/chunk_memory_serializer.hpp

            memory_output_stream stream(buffer);
            any_output_stream astream(stream);
-            serialize_record_batch(rb, astream);
+            serialize_record_batch(rb, astream, m_compression);


Side note: this concatenates all output buffers as a single chunk even though we have chunked_memory_output_stream which would avoid such copies. It is a bit of a waste.

pitrou · 2025-10-16T12:59:27Z

src/compression.cpp

+        if (data.empty())
+        {
+            return {};
+        }


Hmm, this should never happen according to the Flatbuffers spec. Did you encounter this situation somewhere?

src/compression.cpp

pitrou · 2025-10-16T13:24:46Z

src/serialize_utils.cpp

+        if (compression.has_value())
+        {
+            // If compressed, the body size is the sum of compressed buffer sizes + original size prefixes + padding
+            auto [compressed_body, compressed_buffers] = generate_compressed_body_and_buffers(record_batch, compression.value());
+            actual_body_size = compressed_body.size();
+        }
+        else
+        {
+            // If not compressed, the body size is the sum of uncompressed buffer sizes with padding
+            actual_body_size = static_cast<std::size_t>(calculate_body_size(record_batch));
+        }


The same method should be able to handle both the compressed and uncompressed case, especially as you want to transmit non-compressible data uncompressed anyway.

pitrou · 2025-10-16T13:26:00Z

src/serialize_utils.cpp

+                // Write original size (8 bytes) followed by compressed data
+                compressed_body.insert(compressed_body.end(), reinterpret_cast<const uint8_t*>(&original_size), reinterpret_cast<const uint8_t*>(&original_size) + sizeof(int64_t));
+                compressed_body.insert(compressed_body.end(), compressed_buffer_data.begin(), compressed_buffer_data.end());
+
+                // Add padding to the compressed data
+                compressed_body.insert(compressed_body.end(), padding_needed, 0);


This is going to resize and copy the data multiple times. Why not reuse the chunked memory output stream instead?

Hind-M force-pushed the compr_ser branch from 404ded8 to ed87642 Compare October 9, 2025 08:04

Hind-M changed the title ~~Handle compression in serialization~~ Handle lz4 compression Oct 14, 2025

Hind-M force-pushed the compr_ser branch from 43760d7 to 0d8c498 Compare October 16, 2025 09:41

Hind-M added 13 commits October 16, 2025 11:56

First attempt for compression in serialization (not tested and to be …

714d6c3

…cleaned up)

Clean up and fetch

148aecb

More cleanup

7a792b3

Fix fetching lz4

cb9c563

Add lz4 targets to be exported

ecb67e2

Add lz4 compression in deserialization

7ef77be

Add owning_arrow_array_private_data

5fd55aa

Factorize

856bc7d

Remove macro from template

b83fe2b

Add fct and remove cout

25ae702

Move implementation details to cpp and add check boundaries

dbd4e77

Factorize more

533b71f

Fix conflicts and rework serialization with compression

c5eb667

Hind-M force-pushed the compr_ser branch from 0d8c498 to c5eb667 Compare October 16, 2025 10:00

Hind-M marked this pull request as ready for review October 16, 2025 12:04

Hind-M requested review from Alex-PLACET and JohanMabille October 16, 2025 12:09

Alex-PLACET requested changes Oct 16, 2025

View reviewed changes

Alex-PLACET reviewed Oct 16, 2025

View reviewed changes

include/sparrow_ipc/chunk_memory_serializer.hpp Outdated Show resolved Hide resolved

Alex-PLACET reviewed Oct 16, 2025

View reviewed changes

src/serialize.cpp Show resolved Hide resolved

Alex-PLACET reviewed Oct 16, 2025

View reviewed changes

pitrou reviewed Oct 16, 2025

View reviewed changes

Hind-M added 2 commits October 16, 2025 15:55

Add compression test

66b7660

Minor changes

0d949aa

Hind-M added 2 commits October 17, 2025 16:49

Rework compression and add tests

a778fff

Add missing header in test

273fc24


		#include <sparrow/record_batch.hpp>

		#include "Message_generated.h"

Handle lz4 compression #30

Are you sure you want to change the base?

Handle lz4 compression #30

Conversation

Hind-M commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Hind-M commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-PLACET commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hind-M commented Oct 7, 2025 •

edited

Loading

codecov-commenter commented Oct 8, 2025 •

edited

Loading

Hind-M commented Oct 16, 2025 •

edited

Loading