-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
MDEV-32570: Fragment ROW replication events larger than slave_max_allowed_packet #4047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
bnestere
wants to merge
10
commits into
main
Choose a base branch
from
main-MDEV-32570
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The functions to read row log events from buffers used 32-bit numeric types to hold the length of the buffer. MDEV-32570 will add in support for row events larger than what 32-bits can represent, and so this patch changes the type for the length variable to size_t so larger row events can be created from raw memory.
Preparation for MDEV-32570. When fragmenting a large row event into multiple smaller fragment events, each fragment event will have its own checksum attached, thereby negating the need to also store the checksum of the overall large event. The existing code assumes all events will always have checksums, but this won't be true for the rows events that are re-assembled on the replicas. This patch prepares for this by splitting the logic which reads in and creates Log_event objects into two pieces, one which handles the checksum validation; and the other which reads the raw event data (without the checksum) and creates the object. All existing code is unchanged which uses the checksum-assuming version of the event reader. MDEV-32570 will be the only case which will bypass the checksum logic, and will directly create its rows log events from memory without validating checksums (as the checksums will have already been validated by each individual fragment event).
To prepare for MDEV-32570, the Rows_log_event::write_data_body() is split into two functions: 1. write_data_body_metadata(), which writes the context of the rows data (i.e. width, cols, and cols_ai), which will only be written for the first event fragment. 2. write_data_body_rows(), which allows the writing of the rows data to be fragmented by parameterizing the writing of the rows data to start at a given offset and only write a certain length. This lets each row fragment (for MDEV-32570) to contain a chunk of the rows data
3ba40b0
to
d4c9828
Compare
…e_max_allowed_packet This patch solves two problems: 1. Rows log events cannot be transmitted to the slave if their size exceeds slave_max_packet_size (max 1GB at the time of writing this patch, i.e. MariaDB 12.1) 2. Rows log events cannot be binlogged if they are larger than 4GB because the common binlog event header field event_len is 32-bits. This patch adds support for fragmenting large Rows_log_events through a new event type, Partial_rows_log_event. When any given instantiation of a Rows_log_event (e.g. Write_rows_log_event, etc) is too large to be sent to a replica (i.e. larger than the value slave_max_allowed_packet, as configured on a replica), then the rows event must be fragmented into sub-events (i.e. Partial_rows_log_events), each of size slave_max_allowed_packet, so the event can be transmitted to the replica. The replica will then take the content of each of these Partial_rows_log_events, and join them together into a large Rows_log_event to be executed as normal. Partial_rows_log_events are written to the binary log sequentially, and the replica assembles the events in the order they are binlogged. The re-assembly and execution of the original Rows_log_event on the replica happens in Partial_rows_log_event::do_apply_event(). The rgi is extended with a memory buffer that will hold all data for the original Rows_log_event. This buffer is allocated dynamically upon seeing the first Partial_rows_log_event of a group, where its size is calculated by multiplying the total number of fragments by the size of the first fragment (note this will likely overestimate the overall amount of memory needed). As each Partial_rows_log_event is ingested, its Rows_log_event content is appended to this memory buffer. Once the last fragment has added its content, a new Rows_log_event is created using that buffer, and executed. Note this commit only adds the server logic for fragmentic and assembling events, the client logic (mysqlbinlog) is in the next commit. Remaining TODO: 1. Extend MTR testing 2. Refactor of mysqlbinlog's existing usage of Table_map_log_event persistence 3. Possibly add new field into Partial_rows_log_event to provide real full length of the underlying Rows_log_event (so the pre-allocated buffer on the slave side is exactly as long as needed; as opposed to now, which overestimates the buffer size by up to 1 fragment length (calculated via the length of the first Partial_rows_log_event of a group multiplied by the total number of fragments in the group) Alternative designs considered were: 1. Alternative 1: Change the master-slave communication protocol such that the master would send events in chunks of size slave_max_allowed_packet. Though this is still a valid idea, and would solve the first problem described in this commit message, this would still leave the limitation that Rows_log_events could not exceed 4GB. 2. Alternative 2: Create a generic “Container_log_event” with the intention to embed various other types of event data for various purposes, with flags that describe the purpose of a given container. This seemed overboard, as there is already a generic Log_event framework that provides the necessary abstractions to fragment/reassemble events without adding in extra abstractions. 3. Alternative 3: Add a flag to Rows_log_event with semantics to overwrite/correct the event_len field of the common event header to use a 64-bit field stored in the data_header of the Rows_log_event; and also do alternative 1, so the master would send the large (> 4GB) rows event in chunks. This approach would add too much complexity (changing both the binlogging and transport layer); as well as introduce inconsistency to the event definition (event_len and next_event_position would no longer have consistent meanings). Reviewed By: ============ <TODO>
…e_max_allowed_packet This patch extends mysqlbinlog with logic to support the output and replay of the new Partial_rows_log_events added in the previous commit. Generally speaking, as the assembly and execution of the Rows_log_event happens in Partial_rows_log_event::do_apply_event(); there isn’t much logic required other than outputting Partial_rows_log_event in base64. With two exceptions.. In the original mysqlbinlog code, all row events fit within a single BINLOG base64 statement; such that the Table_map_log_event sets up the tables to use, the Row Events open the tables, and then after the BINLOG statement is run, the tables are closed and the rgi is destroyed. No matter how many Row Events within a transaction there are, they are all put into the same BINLOG base64 statement. However, for the new Partial_rows_log_event, each fragment is split into its own BINLOG base64 statement (to respect the server’s configured max_packet_size). The existing logic would close the tables and destroy the replay context after each BINLOG statement (i.e. each fragment). This means that 1) Partial_rows_log_events would be un-able to assemble Rows_log_events because the rgi is destroyed between events, and 2) multiple re-assembled Rows_log_events could not be executed because the context set-up by the Table_map_log_event is cleared after the first Rows_log_event executes. To fix the first problem, where we couldn’t re-assemble Rows_log_events because the rgi would disappear between Partial_rows_log_events, the server will not destroy the rgi when ingesting BINLOG statements containing Partial_rows_log_events that have not yet assembled their Rows_log_event. To fix the second problem, where the context set-up by the Table_map_log_event is cleared after the first assembled Rows_log_event executes, mysqlbinlog caches the Table_map_log_event to re-write for each fragmented Rows_log_event at the start of the last fragment’s BINLOG statement. In effect, this will re-execute the Table_map_log_event for each assembled Rows_log_event. Reviewed By: ============ <TODO>
TODO Finish test cases (see rpl_fragment_row_event.test for specifics on what remains)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is currently a draft.
This PR solves two problems:
size exceeds slave_max_packet_size (max 1GB at the time of
writing this patch, i.e. MariaDB 12.1)
4GB because the common binlog event header field event_len is
32-bits.
This PR adds support for fragmenting large Rows_log_events
through a new event type, Partial_rows_log_event. When any given
instantiation of a Rows_log_event (e.g. Write_rows_log_event, etc)
is too large to be sent to a replica (i.e. larger than the value
slave_max_allowed_packet, as configured on a replica), then the rows
event must be fragmented into sub-events (i.e.
Partial_rows_log_events), each of size slave_max_allowed_packet, so
the event can be transmitted to the replica. The replica will then
take the content of each of these Partial_rows_log_events, and join
them together into a large Rows_log_event to be executed as normal.
Partial_rows_log_events are written to the binary log sequentially,
and the replica assembles the events in the order they are
binlogged.
Remaining things to be done to remove draft status of the PR:
This PR is organized as follows:
feature commits easier to read
to support fragmenting and re-assembling/applying large
Rows_log_events.
replay of Partial_rows_log_events through piping to the
mariadb client
Note that the git commit messages provide much more specific
details on the implementation.