Skip to content

Commit aff77ab

Browse files
author
Sven Sandberg
committed
BUG#36507020: relay_log_space_limit GTID and large txns may lead to no channel progress
Problem ------- When running GTID-based replication with the 'relay_log_space_limit' enabled, restarting the auto positioning protocol may cause a deadlock (infinite loop) in the replication. Analysis -------- When receiver is waiting for available relay log space, it may happen that the connected dump thread times out while sending data due to reaching the 'net_write_timeout'. After receiver wakes up, it writes a part of a trasaction into the relay log, notices that connection is down and reconnects to the source. Under a tight limit of the 'relay_log_space_limit', when only one large transaction fits into the relay log, recevier and coordinator enter a sequential mode of execution: - receiver waits for available storage - applier either releases the storage or, when reaching the state to waiting for a new event, tells receiver to schedule one more. If part of the next transaction entered the relay log after a large transaction, and applying this unfinished transaction takes more time than 'net_write_timeout', the receiver (waiting for available relay log space), applier (which cannot purge whilst being inside of transaction boundary) and dump thread (having restrictions on 'net_write_timeout'), enter an infinite loop of: - receiver writes a part of transaction to the relay log - receiver waits for available storage or notification to schedule one more event - applier starts applying an unfinished transaction - dump thread reaches the 'net_write_timeout' - applier finishes applying a part of a transaction and reaches the state of waiting for new events, notifies the reader to queue one more - receiver wakes up, auto positioning protocol restarts at beginning of the transaction - receiver writes a part of transaction to the relay log, but afterwards hits the limit of 'relay_log_space_limit' - ... Before this patch, the 'relay_log_space_limit' was not respected not only for transactions which size exceed the 'relay_log_space_limit', but also when replica could not purge previous logs. Solution -------- As a solution to the bug fix, we introduce the following changes: Receiver will respect the 'relay_log_space_limit' set by the user, unless a transaction received by the receiver cannot fit into the purged relay log. Before queueing the received transaction, receiver checks whether scheduling a full transaction is possible. If not, the receiver: - sets the flag indicating that recevier is waiting - rotates the relay log - waits until notified that relay log purge was executed and applier has purged all available relay logs Afterwards, receiver may queue a transaction, without checking the limit again. Before moving to the next file, the coordinator will check if receiver is waiting for available relay log space. If yes, the coordinator will forcibly purge applied logs including the current relay log file. To safely purge the current relay log file the coordinator must - synchronize all of its workers before moving to the next file - forcibly update group positions, which is necessary to allow current purging of the relay log - update the variable read by the receiver which contains relay log filename to which applier was moved Those operations are allowed because we know that receiver waits at a transaction boundary and rotates the relay log before waiting. Therefore, in this patch, we remove the existing logic coordinating the coordinator and the recevier which ignore the relay log space limit for one event at a time, until we are able to purge the relay log and reclaim space for the next event. We respect the available relay log space, unless this is no longer possible because a large transaction was received and even though all relay logs was removed, it still cannot fit into available limit. Note about behavior change -------------------------- Before this patch, the 'relay_log_space_limit' was not respected not only for transactions which size exceed the 'relay_log_space_limit', but also when replica could not purge previous logs. After this patch, the relay log can grow only up to max(relay_log_space_limit, max_transaction_size + size required to write relay log header, FDE, RLE, PGLE and 1 GTID for an unfinished transaction in case dump thread disconnects while replica is waiting for available space). Note about change compared to previous push ------------------------------------------- This patch addresses two problems in the previous, reverted push: 1. The previous patch made the applier segfault when the receiver waits for relay log space, in case replica_parallel_workers=0. This made e.g. rpl.rpl_stm_relay_ign_space fail when executed with --replica-parallel-workers=0. The problem was that, when the applier reached EOF while the receiver was waiting for relay log space, the applier would call m_rli->current_mts_mode->wait_for_workers_to_finish. But when replica_parallel_workers==0, current_mts_mode==nullptr. Fixed by calling m_rli->current_mts_mode->wait_for_workers_to_finish only when replica_parallel_workers>0. 2. The previous patch would time out in a number of test cases having partial transactions, in 8.0 but not in 8.4 and trunk. The problem was that the new member Relay_log_info::is_receiver_waiting_for_rl_space, of type std::atomic<bool>, was not explicitly initialized. In 8.0, a default-constructed std::atomic<bool> leaves the value uninitialized, so therefore the initial value could, by chance, evaluate to true, which broke the logic of this variable. This did not occur in 8.4 and trunk, because they use C++20, where default-constructed std::atomic objects are initialized to the default value of the underlying type. Fixed by explicitly initialzing the member to false. Change-Id: I6bcaa312e1159d9b16bb19cf292272e82440dc49
1 parent 2670019 commit aff77ab

23 files changed

+622
-422
lines changed

libbinlogevents/include/binlog_event.h

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,20 @@ enum Log_event_type {
365365
ENUM_END_EVENT /* end marker */
366366
};
367367

368+
/// @brief Event type helpers, enclosed in the structure
369+
struct Log_event_type_helper {
370+
/// @brief Helps to identify known GTID event - returns true
371+
/// for GTID_LOG_EVENT and GTID_TAGGED_LOG_EVENT
372+
inline static bool is_assigned_gtid_event(const Log_event_type &type) {
373+
return type == GTID_LOG_EVENT;
374+
}
375+
/// @brief Helps to identify any GTID event - returns true
376+
/// for GTID_LOG_EVENT, GTID_TAGGED_LOG_EVENT and ANONYMOUS_GTID_LOG_EVENT
377+
inline static bool is_any_gtid_event(const Log_event_type &type) {
378+
return is_assigned_gtid_event(type) || type == ANONYMOUS_GTID_LOG_EVENT;
379+
}
380+
};
381+
368382
/**
369383
Struct to pass basic information about a event: type, query, is it ignorable
370384
*/

libbinlogevents/include/control_events.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1172,6 +1172,9 @@ class Gtid_event : public Binary_log_event {
11721172
transaction_length = transaction_length_arg;
11731173
}
11741174

1175+
/// @return The value of the `transaction_length` field.
1176+
unsigned long long get_trx_length() const { return transaction_length; }
1177+
11751178
/** The version of the server where the transaction was originally executed */
11761179
uint32_t original_server_version;
11771180
/** The version of the immediate server */

mysql-test/include/assert_grep.inc

Lines changed: 44 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
# --let $assert_text= TEXT
1010
# --let $assert_file= FILE
1111
# --let $assert_select= REGEX
12-
# [--let $assert_match= REGEX | --let $assert_count= NUMBER]
12+
# [--let $assert_match= REGEX | --let $assert_count= NUMBER | --let $assert_count_condition= CONDITION]
1313
# [--let $assert_only_after= REGEX]
1414
# --source include/assert_grep.inc
1515
#
@@ -35,6 +35,14 @@
3535
# $assert_match, assert that there were exactly $assert_count
3636
# matching lines.
3737
#
38+
# $assert_count_condition
39+
# Instead of asserting that the selected lines match
40+
# $assert_match, assert that $assert_count_condition is met for
41+
# the number of matching lines. The 'assert_count_condition'
42+
# must be a valid Perl expression, which consists of an operator and
43+
# right side expression - an integer.
44+
# For example, 'assert_count_condition' may be <= 1.
45+
#
3846
# $assert_only_after
3947
# Reset all the lines matched and the counter when finding this pattern.
4048
# It is useful for searching things in the mysqld.err log file just
@@ -58,18 +66,27 @@ if ($assert_match == '')
5866
{
5967
if ($assert_count == '')
6068
{
61-
--die !!!ERROR IN TEST: you must set either $assert_match or $assert_count
62-
}
63-
}
64-
if ($assert_match != '')
65-
{
66-
if ($assert_count != '')
67-
{
68-
--echo assert_text='$assert_text' assert_count='$assert_count'
69-
--die !!!ERROR IN TEST: you must set only one of $assert_match or $assert_count
69+
if ($assert_count_condition == '')
70+
{
71+
--die !!!ERROR IN TEST: you must set either $assert_match or $assert_count or $assert_count_condition
72+
}
7073
}
7174
}
7275

76+
--let $_ag_assert_count = 0
77+
if ($assert_match != '') {
78+
--inc $_ag_assert_count
79+
}
80+
if ($assert_count != '') {
81+
--inc $_ag_assert_count
82+
}
83+
if ($assert_count_condition != '') {
84+
--inc $_ag_assert_count
85+
}
86+
if ($_ag_assert_count != 1) {
87+
--echo assert_text='$assert_text' assert_match='$assert_match' assert_count='$assert_count' assert_count_condition='$assert_count_condition'
88+
--die !!!ERROR IN TEST: you must set exactly one of $assert_match, $assert_count or $assert_count_condition
89+
}
7390

7491
--let $include_filename= assert_grep.inc [$assert_text]
7592
--source include/begin_include_file.inc
@@ -80,6 +97,7 @@ if ($assert_match != '')
8097
--let _AG_ASSERT_SELECT= $assert_select
8198
--let _AG_ASSERT_MATCH= $assert_match
8299
--let _AG_ASSERT_COUNT= $assert_count
100+
--let _AG_ASSERT_COUNT_CONDITION= $assert_count_condition
83101
--let _AG_OUT= `SELECT CONCAT('$MYSQLTEST_VARDIR/tmp/_ag_', UUID())`
84102
--let _AG_ASSERT_ONLY_AFTER= $assert_only_after
85103

@@ -91,12 +109,16 @@ if ($assert_match != '')
91109
my $assert_select= $ENV{'_AG_ASSERT_SELECT'};
92110
my $assert_match= $ENV{'_AG_ASSERT_MATCH'};
93111
my $assert_count= $ENV{'_AG_ASSERT_COUNT'};
112+
my $assert_count_condition= $ENV{'_AG_ASSERT_COUNT_CONDITION'};
94113
my $assert_only_after= $ENV{'_AG_ASSERT_ONLY_AFTER'};
95114
my $out= $ENV{'_AG_OUT'};
96115

97116
if (!defined($assert_count)) {
98117
$assert_count = '';
99118
}
119+
if (!defined($assert_count_condition)) {
120+
$assert_count_condition = '';
121+
}
100122

101123
my $result= '';
102124
my $count= 0;
@@ -109,7 +131,7 @@ if ($assert_match != '')
109131
$count = 0;
110132
}
111133
if ($line =~ /$assert_select/) {
112-
if ($assert_count ne '') {
134+
if ($assert_count ne '' || $assert_count_condition ne '') {
113135
$count++;
114136
}
115137
else {
@@ -122,7 +144,12 @@ if ($assert_match != '')
122144
if ($assert_count ne '' && ($count != $assert_count)) {
123145
print OUT ($count) or die("Error $? writing $out: $!");
124146
}
125-
elsif ($assert_count eq '' && $result !~ /$assert_match/) {
147+
elsif ($assert_count_condition ne '' &&
148+
eval "$count $assert_count_condition" eq '') {
149+
print OUT ($count) or die("Error $? writing $out: $!");
150+
}
151+
elsif ($assert_count eq '' && $assert_count_condition eq '' &&
152+
$result !~ /$assert_match/) {
126153
print OUT ($result) or die("Error $? writing $out: $!");
127154
}
128155
else {
@@ -142,6 +169,7 @@ if ($_ag_outcome != 'assert_grep.inc ok')
142169
--echo assert_select: '$assert_select'
143170
--echo assert_match: '$assert_match'
144171
--echo assert_count: '$assert_count'
172+
--echo assert_count_condition: '$assert_count_condition'
145173
--echo assert_only_after: '$assert_only_after'
146174
if ($assert_match != '')
147175
{
@@ -151,6 +179,10 @@ if ($_ag_outcome != 'assert_grep.inc ok')
151179
{
152180
--echo number of matching lines: $_ag_outcome
153181
}
182+
if ($assert_count_condition != '')
183+
{
184+
--echo number of matching lines: $_ag_outcome
185+
}
154186
--die assert_grep.inc failed.
155187
}
156188

mysql-test/include/have_nodebug.inc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
let $have_debug = `SELECT (VERSION() LIKE '%debug%')`;
22
if ($have_debug)
33
{
4-
--skip Test requires 'have_nodebug'
4+
--skip Test requires binaries to be compiled in release mode
55
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
include/master-slave.inc
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the connection metadata repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START REPLICA; see the 'START REPLICA Syntax' in the MySQL Manual for more information.
5+
[connection master]
6+
7+
# 1. Start the receiver
8+
9+
[connection slave]
10+
include/start_slave_io.inc
11+
12+
# 2. Create a table
13+
# 3. Generate 2 small transactions, rotate relay log after each
14+
# transaction
15+
16+
[connection master]
17+
CREATE TABLE t (a LONGBLOB) ENGINE = InnoDB;
18+
INSERT INTO t VALUES (REPEAT('a', 1000));
19+
include/sync_slave_io_with_master.inc
20+
FLUSH RELAY LOGS;
21+
[connection master]
22+
INSERT INTO t VALUES (REPEAT('a', 1000));
23+
include/sync_slave_io_with_master.inc
24+
FLUSH RELAY LOGS;
25+
26+
# 4. Generate a transaction that exceeds the relay log space limit, which
27+
# will block the receiver
28+
29+
[connection master]
30+
INSERT INTO t VALUES (REPEAT('a', 10000));
31+
32+
# 5. Run replication applier
33+
34+
[connection slave]
35+
include/start_slave_sql.inc
36+
37+
# 6. Synchronize source and replica servers
38+
39+
[connection master]
40+
include/sync_slave_sql_with_master.inc
41+
42+
# 7. Execute a transaction that cannot fit into the relay log space
43+
# limit. After purging relay log, the limit is ignored.
44+
45+
[connection master]
46+
INSERT INTO t VALUES (REPEAT('a', 15000));
47+
48+
# 8. Synchronize source and replica servers
49+
50+
include/sync_slave_sql_with_master.inc
51+
52+
# Clean up
53+
54+
[connection master]
55+
DROP TABLE t;
56+
include/rpl_end.inc

mysql-test/suite/rpl/r/rpl_relay_log_space_limit_deadlock.result

Lines changed: 0 additions & 43 deletions
This file was deleted.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
include/master-slave.inc
2+
Warnings:
3+
Note #### Sending passwords in plain text without SSL/TLS is extremely insecure.
4+
Note #### Storing MySQL user name or password information in the connection metadata repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START REPLICA; see the 'START REPLICA Syntax' in the MySQL Manual for more information.
5+
[connection master]
6+
7+
# 1. Create a table
8+
# 2. Generate several large transactions, containing large events
9+
10+
CREATE TABLE t (a LONGBLOB) ENGINE = InnoDB;
11+
CREATE PROCEDURE gen_transactions(trx_count INT, ev_count INT, insert_byte_count INT)
12+
BEGIN
13+
SET @x = 0;
14+
REPEAT
15+
START TRANSACTION;
16+
SET @ev_cnt = 0;
17+
REPEAT
18+
INSERT INTO t VALUES (REPEAT('a', insert_byte_count));
19+
SET @ev_cnt = @ev_cnt + 1;
20+
UNTIL @ev_cnt >= ev_count
21+
END REPEAT;
22+
COMMIT;
23+
SET @x = @x + 1;
24+
UNTIL @x >= trx_count
25+
END REPEAT;
26+
END %
27+
CALL gen_transactions(5, 3, 104857600);
28+
29+
# 3. Run replication
30+
31+
[connection slave]
32+
include/start_slave.inc
33+
34+
# 4. Synchronize source and replica servers
35+
36+
[connection master]
37+
include/sync_slave_sql_with_master.inc
38+
39+
# 5. Check Pass condition 1: the number of connects to the source
40+
# is less than or equal to the number of transactions
41+
42+
include/assert_grep.inc [Check Pass condition 1]
43+
44+
# Clean up
45+
46+
[connection master]
47+
DROP TABLE t;
48+
DROP PROCEDURE gen_transactions;
49+
include/rpl_end.inc

mysql-test/suite/rpl/r/rpl_relayspace.result

Lines changed: 0 additions & 25 deletions
This file was deleted.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--relay-log-space-limit=11000

0 commit comments

Comments
 (0)