Skip to content

MDEV-32570: Fragment ROW replication events larger than slave_max_allowed_packet #4047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

bnestere
Copy link
Contributor

@bnestere bnestere commented May 16, 2025

This PR is currently a draft.

This PR solves two problems:

  1. Rows log events cannot be transmitted to the slave if their
    size exceeds slave_max_packet_size (max 1GB at the time of
    writing this patch, i.e. MariaDB 12.1)
  2. Rows log events cannot be binlogged if they are larger than
    4GB because the common binlog event header field event_len is
    32-bits.

This PR adds support for fragmenting large Rows_log_events
through a new event type, Partial_rows_log_event. When any given
instantiation of a Rows_log_event (e.g. Write_rows_log_event, etc)
is too large to be sent to a replica (i.e. larger than the value
slave_max_allowed_packet, as configured on a replica), then the rows
event must be fragmented into sub-events (i.e.
Partial_rows_log_events), each of size slave_max_allowed_packet, so
the event can be transmitted to the replica. The replica will then
take the content of each of these Partial_rows_log_events, and join
them together into a large Rows_log_event to be executed as normal.
Partial_rows_log_events are written to the binary log sequentially,
and the replica assembles the events in the order they are
binlogged.

Remaining things to be done to remove draft status of the PR:

  1. Implement error handling
  2. Implementing Partial_rows_log_event::pack_info()
  3. Extend MTR testing

This PR is organized as follows:

  • Commits 1 - 3 commits add code preparations to make the actual
    feature commits easier to read
  • Commit 4 adds the Partial_rows_log_event type and the server logic
    to support fragmenting and re-assembling/applying large
    Rows_log_events.
  • Commit 5 adds client (mysqlbinlog) logic to support output and
    replay of Partial_rows_log_events through piping to the
    mariadb client
  • Commit 6 adds the MTR test (still working on this) for the feature
  • Commit 7 adjusts existing MTR tests to pass

Note that the git commit messages provide much more specific
details on the implementation.

bnestere added 3 commits May 16, 2025 15:48
The functions to read row log events from buffers used 32-bit numeric
types to hold the length of the buffer. MDEV-32570 will add in support
for row events larger than what 32-bits can represent, and so this
patch changes the type for the length variable to size_t so larger
row events can be created from raw memory.
Preparation for MDEV-32570. When fragmenting a large row event into
multiple smaller fragment events, each fragment event will have its own
checksum attached, thereby negating the need to also store the checksum
of the overall large event.

The existing code assumes all events will always have checksums, but
this won't be true for the rows events that are re-assembled on the
replicas. This patch prepares for this by splitting the logic which
reads in and creates Log_event objects into two pieces, one which
handles the checksum validation; and the other which reads the raw
event data (without the checksum) and creates the object.

All existing code is unchanged which uses the checksum-assuming version
of the event reader. MDEV-32570 will be the only case which will bypass
the checksum logic, and will directly create its rows log events from
memory without validating checksums (as the checksums will have already
been validated by each individual fragment event).
To prepare for MDEV-32570, the Rows_log_event::write_data_body() is
split into two functions:
 1. write_data_body_metadata(), which writes the context of the rows
    data (i.e. width, cols, and cols_ai), which will only be written
    for the first event fragment.
 2. write_data_body_rows(), which allows the writing of the rows data
    to be fragmented by parameterizing the writing of the rows data to
    start at a given offset and only write a certain length. This lets
    each row fragment (for MDEV-32570) to contain a chunk of the rows
    data
@bnestere bnestere requested a review from knielsen May 16, 2025 21:59
@bnestere bnestere added MariaDB Corporation Replication Patches involved in replication labels May 16, 2025
@bnestere bnestere force-pushed the main-MDEV-32570 branch 2 times, most recently from 3ba40b0 to d4c9828 Compare May 19, 2025 21:37
bnestere added 5 commits May 21, 2025 14:34
…e_max_allowed_packet

This patch solves two problems:
  1. Rows log events cannot be transmitted to the slave if their
     size exceeds slave_max_packet_size (max 1GB at the time of
     writing this patch, i.e. MariaDB 12.1)
  2. Rows log events cannot be binlogged if they are larger than
     4GB because the common binlog event header field event_len is
     32-bits.

This patch adds support for fragmenting large Rows_log_events
through a new event type, Partial_rows_log_event. When any given
instantiation of a Rows_log_event (e.g. Write_rows_log_event, etc)
is too large to be sent to a replica (i.e. larger than the value
slave_max_allowed_packet, as configured on a replica), then the rows
event must be fragmented into sub-events (i.e.
Partial_rows_log_events), each of size slave_max_allowed_packet, so
the event can be transmitted to the replica. The replica will then
take the content of each of these Partial_rows_log_events, and join
them together into a large Rows_log_event to be executed as normal.
Partial_rows_log_events are written to the binary log sequentially,
and the replica assembles the events in the order they are
binlogged.

The re-assembly and execution of the original Rows_log_event on the
replica happens in Partial_rows_log_event::do_apply_event(). The rgi
is extended with a memory buffer that will hold all data for the
original Rows_log_event. This buffer is allocated dynamically upon
seeing the first Partial_rows_log_event of a group, where its size
is calculated by multiplying the total number of fragments by the
size of the first fragment (note this will likely overestimate the
overall amount of memory needed). As each Partial_rows_log_event is
ingested, its Rows_log_event content is appended to this memory
buffer. Once the last fragment has added its content, a new
Rows_log_event is created using that buffer, and executed.

Note this commit only adds the server logic for fragmentic and
assembling events, the client logic (mysqlbinlog) is in the next
commit.

Remaining TODO:
  1. Extend MTR testing
  2. Refactor of mysqlbinlog's existing usage of Table_map_log_event
     persistence
  3. Possibly add new field into Partial_rows_log_event to provide
     real full length of the underlying Rows_log_event (so the
     pre-allocated buffer on the slave side is exactly as long as
     needed; as opposed to now, which overestimates the buffer size
     by up to 1 fragment length (calculated via the length of the first
     Partial_rows_log_event of a group multiplied by the total number
     of fragments in the group)

Alternative designs considered were:
  1. Alternative 1: Change the master-slave communication protocol
     such that the master would send events in chunks of size
     slave_max_allowed_packet. Though this is still a valid idea,
     and would solve the first problem described in this commit
     message, this would still leave the limitation that
     Rows_log_events could not exceed 4GB.
  2. Alternative 2: Create a generic “Container_log_event” with the
     intention to embed various other types of event data for
     various purposes, with flags that describe the purpose of a
     given container. This seemed overboard, as there is already a
     generic Log_event framework that provides the necessary
     abstractions to fragment/reassemble events without adding in
     extra abstractions.
  3. Alternative 3: Add a flag to Rows_log_event with semantics to
     overwrite/correct the event_len field of the common event
     header to use a 64-bit field stored in the data_header of the
     Rows_log_event; and also do alternative 1, so the master would
     send the large (> 4GB) rows event in chunks. This approach
     would add too much complexity (changing both the binlogging
     and transport layer); as well as introduce inconsistency to
     the event definition (event_len and next_event_position would
     no longer have consistent meanings).

Reviewed By:
============
<TODO>
…e_max_allowed_packet

This patch extends mysqlbinlog with logic to support the output and
replay of the new Partial_rows_log_events added in the previous
commit. Generally speaking, as the assembly and execution of the
Rows_log_event happens in Partial_rows_log_event::do_apply_event();
there isn’t much logic required other than outputting
Partial_rows_log_event in base64. With two exceptions..

In the original mysqlbinlog code, all row events fit within a single
BINLOG base64 statement; such that the Table_map_log_event sets up
the tables to use, the Row Events open the tables, and then after
the BINLOG statement is run, the tables are closed and the rgi is
destroyed. No matter how many Row Events within a transaction there
are, they are all put into the same BINLOG base64 statement.
However, for the new Partial_rows_log_event, each fragment is split
into its own BINLOG base64 statement (to respect the server’s
configured max_packet_size). The existing logic would close the
tables and destroy the replay context after each BINLOG statement
(i.e. each fragment). This means that 1) Partial_rows_log_events
would be un-able to assemble Rows_log_events because the rgi is
destroyed between events, and 2) multiple re-assembled
Rows_log_events could not be executed because the context set-up by
the Table_map_log_event is cleared after the first Rows_log_event
executes.

To fix the first problem, where we couldn’t re-assemble
Rows_log_events because the rgi would disappear between
Partial_rows_log_events, the server will not destroy the rgi when
ingesting BINLOG statements containing Partial_rows_log_events that
have not yet assembled their Rows_log_event.

To fix the second problem, where the context set-up by the
Table_map_log_event is cleared after the first assembled
Rows_log_event executes, mysqlbinlog caches the Table_map_log_event
to re-write for each fragmented Rows_log_event at the start of the
last fragment’s BINLOG statement. In effect, this will re-execute
the Table_map_log_event for each assembled Rows_log_event.

Reviewed By:
============
<TODO>
TODO Finish test cases (see rpl_fragment_row_event.test for specifics
on what remains)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MariaDB Corporation Replication Patches involved in replication
Development

Successfully merging this pull request may close these issues.

2 participants