Skip to content

Comments

Implement batched I/O submission optimization for snapshot dump#1969

Open
maxkozlovsky wants to merge 1 commit intomainfrom
max/batched-io-submission
Open

Implement batched I/O submission optimization for snapshot dump#1969
maxkozlovsky wants to merge 1 commit intomainfrom
max/batched-io-submission

Conversation

@maxkozlovsky
Copy link
Contributor

Add deferred I/O submission with batching to reduce syscall overhead during snapshot dumps. Instead of calling io_uring_submit() for every read operation, accumulate up to 64 SQEs before submitting them in a single syscall.

Key changes:

  • Add pending_sqes_since_submit_ counter and BATCH_SIZE (64) constant
  • Implement submit_request_deferred_() for deferred SQE preparation
  • Add flush_pending_reads() to manually flush pending operations
  • Update traverse layer to use deferred submission (defer_submit=true)
  • Add batching to dequeue_concurrent_read_ios_pending()
  • Smart flush on blocking wait in poll_uring_()

The optimization reduces io_uring_submit() syscalls by ~64x, significantly improving snapshot dump performance through reduced kernel context switching and better batched I/O processing.

This code was generated using Claude Sonnet 4.5

@maxkozlovsky maxkozlovsky force-pushed the max/batched-io-submission branch from b263727 to 3fdea31 Compare December 10, 2025 14:56
Base automatically changed from max/queue-sizes to main December 10, 2025 19:27
@maxkozlovsky maxkozlovsky force-pushed the max/batched-io-submission branch from 3fdea31 to 76da738 Compare December 10, 2025 19:29
@maxkozlovsky maxkozlovsky marked this pull request as ready for review December 10, 2025 19:29
Copilot AI review requested due to automatic review settings December 10, 2025 19:29
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a batched I/O submission optimization for snapshot dumps to reduce syscall overhead. Instead of calling io_uring_submit() for every read operation, the changes accumulate up to 64 SQEs (Submission Queue Entries) before submitting them in a single syscall, aiming to reduce the number of syscalls by ~64x.

Key changes:

  • Added deferred submission mechanism with defer_submit parameter throughout the I/O stack
  • Introduced pending_sqes_since_submit_ counter and BATCH_SIZE constant (64) for batching
  • Implemented auto-flush when batch is full and smart flush before blocking waits
  • Updated traverse layer to use deferred submission for better batching during tree traversal

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
category/async/io.hpp Added deferred submission API with defer_submit parameter, flush_pending_reads() method, and pending_sqes_since_submit_ counter tracking
category/async/io.cpp Implemented batching logic with submit_request_deferred_(), auto-flush at BATCH_SIZE, and smart flush in poll_uring_() before blocking
category/async/io_senders.hpp Added defer_submit_ member to read_single_buffer_sender to propagate deferred submission flag through the sender/receiver chain
category/mpt/trie.hpp Updated read_short_update_sender and async_read() to accept and forward defer_submit parameter
category/mpt/traverse.hpp Modified traverse operations to use defer_submit=true for batching reads during tree traversal
category/core/io/ring.hpp Removed unused must_call_uring_submit() method that was previously used in now-deleted SQPOLL/IOPOLL checking code

Critical Issues Found:

  1. Counter reset on write operations (lines 423, 481 in io.cpp): The pending_sqes_since_submit_ counter tracks deferred read submissions but is reset to 0 after write operations. This can leave deferred read SQEs unsubmitted indefinitely if writes and reads are interleaved.

  2. Missing flush in submission queue full handler (line 355 in io.hpp): poll_uring_while_submission_queue_full_() is called before deferred submission but doesn't flush pending SQEs, potentially causing a deadlock where the code waits for completions but hasn't submitted all pending operations.

  3. Concurrent read limit bypasses defer_submit (lines 347-352 in io.hpp): When the concurrent read limit is reached, operations are queued in concurrent_read_ios_pending_ and the defer_submit flag is lost, causing inconsistent batching behavior.

  4. Counter mixing across contexts (lines 542-549 in io.cpp): The pending_sqes_since_submit_ counter is shared between different submission paths without considering that operations might target different rings.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +573 to +577
// Submit remaining SQEs after exiting loop if any were processed
if (any_sqes_added && pending_sqes_since_submit_ > 0) {
MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(other_ring));
pending_sqes_since_submit_ = 0;
}
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the while loop exits in dequeue_concurrent_read_ios_pending, there's a check to submit remaining SQEs if any were processed (lines 573-577). However, this flush only happens if any_sqes_added is true. If the loop doesn't process any SQEs because the conditions aren't met, but there are pending SQEs from previous calls to submit_request_deferred_(), those pending SQEs won't be flushed here. This is correct behavior, but it means the pending SQEs might remain unsubmitted for an extended period. Consider adding a comment explaining that pending SQEs from other sources are intentionally not flushed here.

Copilot uses AI. Check for mistakes.
};

static_assert(sizeof(read_single_buffer_sender) == 40);
static_assert(sizeof(read_single_buffer_sender) == 48);
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sizeof assertion has been updated from 40 to 48 bytes to account for the new defer_submit_ boolean member. However, adding a single bool (1 byte) to a 40-byte structure shouldn't increase the size to 48 bytes unless there's significant padding. With 8-byte alignment, the actual increase should be 8 bytes (40 -> 48), which suggests the bool is being padded to 8 bytes due to its placement after the buffer_ member. Consider reordering members to minimize padding, such as placing defer_submit_ before buffer_ or grouping it with other small members if any exist, to potentially reduce the structure size.

Copilot uses AI. Check for mistakes.

prepare_read_sqe_(sqe, buffers, chunk_and_offset, uring_data, prio);
MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(&uring_.get_ring()));
pending_sqes_since_submit_ = 0;
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the issue with write operations at line 481, this write operation also resets pending_sqes_since_submit_ to 0 (line 423). The counter is meant to track deferred read submissions, but resetting it after a write submission could leave deferred read SQEs unsubmitted. This is especially problematic if writes and reads are interleaved. Consider only resetting the counter after read operations that actually submit to the same ring where pending reads are queued.

Suggested change
pending_sqes_since_submit_ = 0;

Copilot uses AI. Check for mistakes.
Comment on lines +360 to +361
// Defer submission to batch reads together
async_read(sender.aux, std::move(receiver), true);
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as lines 256-258: the comment explains deferred submission for batching, but there's no explicit flush mechanism documented or implemented nearby. The code relies on implicit flushing during polling operations.

Copilot uses AI. Check for mistakes.
return size_t(-1); // we never complete immediately
}

poll_uring_while_submission_queue_full_();
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The poll_uring_while_submission_queue_full_() function is called before submit_request_deferred_() to ensure submission queue space is available. However, if there are pending deferred SQEs from previous calls, and the submission queue becomes full, this function may poll for completions but won't flush the pending SQEs. This could lead to a deadlock-like situation where we're waiting for completions but haven't submitted all the pending operations. Consider flushing pending SQEs in poll_uring_while_submission_queue_full_() before polling.

Copilot uses AI. Check for mistakes.
Comment on lines +125 to 132
// Helper to prepare SQE without submitting
void prepare_read_sqe_(
std::span<std::byte> buffer, chunk_offset_t chunk_and_offset,
void *uring_data, enum erased_connected_operation::io_priority prio);

// Prepare SQE for read operation (without submitting)
void prepare_read_sqe_(
struct io_uring_sqe *sqe, std::span<std::byte> buffer,
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two overloaded declarations of prepare_read_sqe_() with very similar but different purposes. The first one (lines 126-128) takes a buffer and internally gets an SQE, while the second one (lines 131-134) takes an already-allocated SQE. The comment on line 125 says "Helper to prepare SQE without submitting" and the comment on line 130 says "Prepare SQE for read operation (without submitting)". These comments and similar signatures make it unclear which function to use in different contexts. Consider renaming one of them or improving the documentation to clearly distinguish their purposes.

Copilot uses AI. Check for mistakes.
Comment on lines +542 to +549
any_sqes_added = true;
++pending_sqes_since_submit_;

// Submit if batch is full
if (pending_sqes_since_submit_ >= BATCH_SIZE) {
MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(other_ring));
pending_sqes_since_submit_ = 0;
}
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the dequeue_concurrent_read_ios_pending lambda, the pending_sqes_since_submit_ counter is incremented and checked for batching. However, this counter tracks SQEs across different code paths (deferred submission via submit_request_deferred_() and concurrent read queueing). If both code paths are active simultaneously, the counter could reach BATCH_SIZE and trigger a flush even though the SQEs might be on different rings (other_ring vs wr_ring) or different submission contexts. This could lead to premature flushing or incorrect counter state. Consider having separate counters for different submission contexts or ensuring the counter is ring-specific.

Copilot uses AI. Check for mistakes.

io_uring_sqe_set_data(sqe, uring_data);
MONAD_ASYNC_IO_URING_RETRYABLE(io_uring_submit(wr_ring));
pending_sqes_since_submit_ = 0;
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pending_sqes_since_submit_ counter is reset to 0 after write operations (line 481), but this counter is intended to track read operations that use deferred submission. Write operations don't use the deferred submission mechanism, so resetting this counter after a write could mask pending read SQEs that haven't been submitted yet. If a deferred read is prepared, then a write occurs, the counter resets to 0, and the read SQE remains unsubmitted indefinitely until another operation flushes it. Consider using separate counters for read and write operations, or only resetting the counter after read submissions.

Suggested change
pending_sqes_since_submit_ = 0;

Copilot uses AI. Check for mistakes.
Comment on lines +256 to +258
// Defer submission to batch reads together
async_read(
aux, std::move(reads_to_initiate[idx].front()), true);
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Defer submission to batch reads together" which correctly explains the intent. However, there's no corresponding call to flush_pending_reads() after the batch of reads is initiated. In traverse operations, after all reads are queued with defer_submit=true, the pending SQEs should be flushed before waiting for completions. The current code relies on the implicit flush that happens in poll_uring_() when blocking, but this isn't explicitly documented here and could be missed if the control flow changes.

Copilot uses AI. Check for mistakes.
@jhunsaker jhunsaker force-pushed the max/batched-io-submission branch from 76da738 to 6837c4f Compare December 10, 2025 19:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Add deferred I/O submission with batching to reduce syscall overhead
during snapshot dumps. Instead of calling io_uring_submit() for every
read operation, accumulate up to 64 SQEs before submitting them in a
single syscall.

Key changes:
- Add pending_sqes_since_submit_ counter and BATCH_SIZE (64) constant
- Implement submit_request_deferred_() for deferred SQE preparation
- Add flush_pending_reads() to manually flush pending operations
- Update traverse layer to use deferred submission (defer_submit=true)
- Add batching to dequeue_concurrent_read_ios_pending()
- Smart flush on blocking wait in poll_uring_()

The optimization reduces io_uring_submit() syscalls by ~64x, significantly
improving snapshot dump performance through reduced kernel context switching
and better batched I/O processing.

This code was generated using Claude Sonnet 4.5
@jhunsaker jhunsaker force-pushed the max/batched-io-submission branch from 6837c4f to fd69612 Compare December 11, 2025 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant