Implement batched I/O submission optimization for snapshot dump by maxkozlovsky · Pull Request #1969 · category-labs/monad

maxkozlovsky · 2025-12-09T23:21:07Z

Add deferred I/O submission with batching to reduce syscall overhead during snapshot dumps. Instead of calling io_uring_submit() for every read operation, accumulate up to 64 SQEs before submitting them in a single syscall.

Key changes:

Add pending_sqes_since_submit_ counter and BATCH_SIZE (64) constant
Implement submit_request_deferred_() for deferred SQE preparation
Add flush_pending_reads() to manually flush pending operations
Update traverse layer to use deferred submission (defer_submit=true)
Add batching to dequeue_concurrent_read_ios_pending()
Smart flush on blocking wait in poll_uring_()

The optimization reduces io_uring_submit() syscalls by ~64x, significantly improving snapshot dump performance through reduced kernel context switching and better batched I/O processing.

This code was generated using Claude Sonnet 4.5

Copilot

Pull request overview

This PR implements a batched I/O submission optimization for snapshot dumps to reduce syscall overhead. Instead of calling io_uring_submit() for every read operation, the changes accumulate up to 64 SQEs (Submission Queue Entries) before submitting them in a single syscall, aiming to reduce the number of syscalls by ~64x.

Key changes:

Added deferred submission mechanism with defer_submit parameter throughout the I/O stack
Introduced pending_sqes_since_submit_ counter and BATCH_SIZE constant (64) for batching
Implemented auto-flush when batch is full and smart flush before blocking waits
Updated traverse layer to use deferred submission for better batching during tree traversal

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
category/async/io.hpp	Added deferred submission API with `defer_submit` parameter, `flush_pending_reads()` method, and `pending_sqes_since_submit_` counter tracking
category/async/io.cpp	Implemented batching logic with `submit_request_deferred_()`, auto-flush at BATCH_SIZE, and smart flush in `poll_uring_()` before blocking
category/async/io_senders.hpp	Added `defer_submit_` member to `read_single_buffer_sender` to propagate deferred submission flag through the sender/receiver chain
category/mpt/trie.hpp	Updated `read_short_update_sender` and `async_read()` to accept and forward `defer_submit` parameter
category/mpt/traverse.hpp	Modified traverse operations to use `defer_submit=true` for batching reads during tree traversal
category/core/io/ring.hpp	Removed unused `must_call_uring_submit()` method that was previously used in now-deleted SQPOLL/IOPOLL checking code

Critical Issues Found:

Counter reset on write operations (lines 423, 481 in io.cpp): The pending_sqes_since_submit_ counter tracks deferred read submissions but is reset to 0 after write operations. This can leave deferred read SQEs unsubmitted indefinitely if writes and reads are interleaved.
Missing flush in submission queue full handler (line 355 in io.hpp): poll_uring_while_submission_queue_full_() is called before deferred submission but doesn't flush pending SQEs, potentially causing a deadlock where the code waits for completions but hasn't submitted all pending operations.
Concurrent read limit bypasses defer_submit (lines 347-352 in io.hpp): When the concurrent read limit is reached, operations are queued in concurrent_read_ios_pending_ and the defer_submit flag is lost, causing inconsistent batching behavior.
Counter mixing across contexts (lines 542-549 in io.cpp): The pending_sqes_since_submit_ counter is shared between different submission paths without considering that operations might target different rings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-10T19:37:12Z

category/async/io.cpp

+            // Submit remaining SQEs after exiting loop if any were processed
+            if (any_sqes_added && pending_sqes_since_submit_ > 0) {
+                MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(other_ring));
+                pending_sqes_since_submit_ = 0;
+            }


After the while loop exits in dequeue_concurrent_read_ios_pending, there's a check to submit remaining SQEs if any were processed (lines 573-577). However, this flush only happens if any_sqes_added is true. If the loop doesn't process any SQEs because the conditions aren't met, but there are pending SQEs from previous calls to submit_request_deferred_(), those pending SQEs won't be flushed here. This is correct behavior, but it means the pending SQEs might remain unsubmitted for an extended period. Consider adding a comment explaining that pending SQEs from other sources are intentionally not flushed here.

Copilot · 2025-12-10T19:37:12Z

category/async/io_senders.hpp

 };

-static_assert(sizeof(read_single_buffer_sender) == 40);
+static_assert(sizeof(read_single_buffer_sender) == 48);


The sizeof assertion has been updated from 40 to 48 bytes to account for the new defer_submit_ boolean member. However, adding a single bool (1 byte) to a 40-byte structure shouldn't increase the size to 48 bytes unless there's significant padding. With 8-byte alignment, the actual increase should be 8 bytes (40 -> 48), which suggests the bool is being padded to 8 bytes due to its placement after the buffer_ member. Consider reordering members to minimize padding, such as placing defer_submit_ before buffer_ or grouping it with other small members if any exist, to potentially reduce the structure size.

Copilot · 2025-12-10T19:37:12Z

category/async/io.cpp


    prepare_read_sqe_(sqe, buffers, chunk_and_offset, uring_data, prio);
    MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(&uring_.get_ring()));
+    pending_sqes_since_submit_ = 0;


Similar to the issue with write operations at line 481, this write operation also resets pending_sqes_since_submit_ to 0 (line 423). The counter is meant to track deferred read submissions, but resetting it after a write submission could leave deferred read SQEs unsubmitted. This is especially problematic if writes and reads are interleaved. Consider only resetting the counter after read operations that actually submit to the same ring where pending reads are queued.

Suggested change

pending_sqes_since_submit_ = 0;

Copilot · 2025-12-10T19:37:12Z

category/mpt/traverse.hpp

+                    // Defer submission to batch reads together
+                    async_read(sender.aux, std::move(receiver), true);


Same issue as lines 256-258: the comment explains deferred submission for batching, but there's no explicit flush mechanism documented or implemented nearby. The code relies on implicit flushing during polling operations.

Copilot · 2025-12-10T19:37:13Z

category/async/io.hpp

            return size_t(-1); // we never complete immediately
        }

+        poll_uring_while_submission_queue_full_();


The poll_uring_while_submission_queue_full_() function is called before submit_request_deferred_() to ensure submission queue space is available. However, if there are pending deferred SQEs from previous calls, and the submission queue becomes full, this function may poll for completions but won't flush the pending SQEs. This could lead to a deadlock-like situation where we're waiting for completions but haven't submitted all the pending operations. Consider flushing pending SQEs in poll_uring_while_submission_queue_full_() before polling.

Copilot · 2025-12-10T19:37:13Z

category/async/io.hpp

+    // Helper to prepare SQE without submitting
+    void prepare_read_sqe_(
+        std::span<std::byte> buffer, chunk_offset_t chunk_and_offset,
+        void *uring_data, enum erased_connected_operation::io_priority prio);
+
    // Prepare SQE for read operation (without submitting)
    void prepare_read_sqe_(
        struct io_uring_sqe *sqe, std::span<std::byte> buffer,


There are two overloaded declarations of prepare_read_sqe_() with very similar but different purposes. The first one (lines 126-128) takes a buffer and internally gets an SQE, while the second one (lines 131-134) takes an already-allocated SQE. The comment on line 125 says "Helper to prepare SQE without submitting" and the comment on line 130 says "Prepare SQE for read operation (without submitting)". These comments and similar signatures make it unclear which function to use in different contexts. Consider renaming one of them or improving the documentation to clearly distinguish their purposes.

Copilot · 2025-12-10T19:37:13Z

category/async/io.cpp

+                any_sqes_added = true;
+                ++pending_sqes_since_submit_;
+
+                // Submit if batch is full
+                if (pending_sqes_since_submit_ >= BATCH_SIZE) {
+                    MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(other_ring));
+                    pending_sqes_since_submit_ = 0;
+                }


In the dequeue_concurrent_read_ios_pending lambda, the pending_sqes_since_submit_ counter is incremented and checked for batching. However, this counter tracks SQEs across different code paths (deferred submission via submit_request_deferred_() and concurrent read queueing). If both code paths are active simultaneously, the counter could reach BATCH_SIZE and trigger a flush even though the SQEs might be on different rings (other_ring vs wr_ring) or different submission contexts. This could lead to premature flushing or incorrect counter state. Consider having separate counters for different submission contexts or ensuring the counter is ring-specific.

Copilot · 2025-12-10T19:37:13Z

category/async/io.cpp


    io_uring_sqe_set_data(sqe, uring_data);
    MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(wr_ring));
+    pending_sqes_since_submit_ = 0;


The pending_sqes_since_submit_ counter is reset to 0 after write operations (line 481), but this counter is intended to track read operations that use deferred submission. Write operations don't use the deferred submission mechanism, so resetting this counter after a write could mask pending read SQEs that haven't been submitted yet. If a deferred read is prepared, then a write occurs, the counter resets to 0, and the read SQE remains unsubmitted indefinitely until another operation flushes it. Consider using separate counters for read and write operations, or only resetting the counter after read submissions.

Suggested change

pending_sqes_since_submit_ = 0;

Copilot · 2025-12-10T19:37:14Z

category/mpt/traverse.hpp

+                    // Defer submission to batch reads together
+                    async_read(
+                        aux, std::move(reads_to_initiate[idx].front()), true);


The comment says "Defer submission to batch reads together" which correctly explains the intent. However, there's no corresponding call to flush_pending_reads() after the batch of reads is initiated. In traverse operations, after all reads are queued with defer_submit=true, the pending SQEs should be flushed before waiting for completions. The current code relies on the implicit flush that happens in poll_uring_() when blocking, but this isn't explicitly documented here and could be missed if the control flow changes.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Add deferred I/O submission with batching to reduce syscall overhead during snapshot dumps. Instead of calling io_uring_submit() for every read operation, accumulate up to 64 SQEs before submitting them in a single syscall. Key changes: - Add pending_sqes_since_submit_ counter and BATCH_SIZE (64) constant - Implement submit_request_deferred_() for deferred SQE preparation - Add flush_pending_reads() to manually flush pending operations - Update traverse layer to use deferred submission (defer_submit=true) - Add batching to dequeue_concurrent_read_ios_pending() - Smart flush on blocking wait in poll_uring_() The optimization reduces io_uring_submit() syscalls by ~64x, significantly improving snapshot dump performance through reduced kernel context switching and better batched I/O processing. This code was generated using Claude Sonnet 4.5

maxkozlovsky force-pushed the max/batched-io-submission branch from b263727 to 3fdea31 Compare December 10, 2025 14:56

Base automatically changed from max/queue-sizes to main December 10, 2025 19:27

maxkozlovsky force-pushed the max/batched-io-submission branch from 3fdea31 to 76da738 Compare December 10, 2025 19:29

maxkozlovsky marked this pull request as ready for review December 10, 2025 19:29

Copilot AI review requested due to automatic review settings December 10, 2025 19:29

Copilot started reviewing on behalf of maxkozlovsky December 10, 2025 19:30 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

jhunsaker force-pushed the max/batched-io-submission branch from 76da738 to 6837c4f Compare December 10, 2025 19:42

Copilot AI reviewed Dec 10, 2025

View reviewed changes

jhunsaker force-pushed the max/batched-io-submission branch from 6837c4f to fd69612 Compare December 11, 2025 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Implement batched I/O submission optimization for snapshot dump#1969

Implement batched I/O submission optimization for snapshot dump#1969
maxkozlovsky wants to merge 1 commit intomainfrom
max/batched-io-submission

maxkozlovsky commented Dec 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		// Defer submission to batch reads together
		async_read(sender.aux, std::move(receiver), true);

Comments

Conversation

maxkozlovsky commented Dec 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant