-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Robustness test to reproduce issue 18089 #19169
Conversation
Hi @jmao-dd. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
tests/robustness/client/watch.go
Outdated
@@ -71,7 +71,7 @@ func watchUntilRevision(ctx context.Context, t *testing.T, c *RecordingClient, m | |||
defer cancel() | |||
resetWatch: | |||
for { | |||
watch := c.Watch(ctx, "", lastRevision+1, true, true, false) | |||
watch := c.Watch(ctx, "", lastRevision, true, true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changed to simulate a slow watcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This +1
is intended to simulate a watch whole history of revisions, while the single watch request can break. Don't think this is a good place to simulate slow watcher, let's discuss other ways to simulate slow watcher.
My quick idea would be to add a sleep when processing a watch event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, will try adding sleep to watcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a failpoint for that SleepBeforeSendWatchResponse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest commit of following changes
- SleepBeforeSendWatchResponse failpoint
- 100ms compact period (from originally 200ms)
- 50/50 Put and Delete traffic
I executed the command 3 times and reproduced the issue in 51 seconds, 71 seconds and 69 seconds. Each round of test takes about 4.5 seconds in my env so it was about 10 to 16 rounds.
If I change compact period back to 200ms sometimes it fails to reproduce the issue.
{Choice: KubernetesCreate, Weight: 5}, | ||
{Choice: KubernetesUpdate, Weight: 50}, | ||
{Choice: KubernetesDelete, Weight: 25}, | ||
{Choice: KubernetesCreate, Weight: 25}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to simulate high Delete traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you picked Kubernetes traffic to reproduce the issue? I haven't finished the implementation of Kubernetes compaction implementation, but long term Kubernetes traffic will not be compacting on Delete to align with real Kubernetes behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reason. I can definitely switch to EtcdPut traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, thanks.
tests/robustness/traffic/traffic.go
Outdated
@@ -37,7 +37,7 @@ var ( | |||
RequestTimeout = 200 * time.Millisecond | |||
WatchTimeout = time.Second | |||
MultiOpTxnOpCount = 4 | |||
CompactionPeriod = 200 * time.Millisecond | |||
CompactionPeriod = 50 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to simulate high compact frequency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running compact creates additional load, which in CI can cause the test to not hit our required QPS numbers. We should double check if increasing frequency is ok before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. With 200ms compact period it was difficult for me to reproduce the issue since the chance of a compaction happening on a Delete request was low. I can try adding sleep to watcher while keeping compact period at 200ms.
This is really great result, we just need to better to integrate it with the test suite to avoid introducing flakiness. |
Thanks for the reminder. Will sign my commit in a real PR. This PR is just a POC and I will close it later. |
No need to create a separate PR, you can just push a new commit to the same branch using |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 33 files with indirect coverage changes @@ Coverage Diff @@
## main #19169 +/- ##
==========================================
+ Coverage 68.74% 68.80% +0.06%
==========================================
Files 420 420
Lines 35647 35629 -18
==========================================
+ Hits 24504 24514 +10
+ Misses 9713 9697 -16
+ Partials 1430 1418 -12 Continue to review full report in Codecov by Sentry.
|
3543858
to
43bfd19
Compare
@@ -49,6 +49,11 @@ test-robustness-issue17780: /tmp/etcd-v3.5.13-compactBeforeSetFinishedCompact/bi | |||
GO_TEST_FLAGS='-v --run=TestRobustnessRegression/Issue17780 --count 200 --failfast --bin-dir=/tmp/etcd-v3.5.13-compactBeforeSetFinishedCompact/bin' make test-robustness && \ | |||
echo "Failed to reproduce" || echo "Successful reproduction" | |||
|
|||
.PHONY: test-robustness-issue18089 | |||
test-robustness-issue18089: /tmp/etcd-v3.5.12-beforeSendWatchResponse/bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose etcd-v3.5.12-beforeSendWatchResponse since only this version has beforeSendWatchResponse fail point patched. Please let me know if I should patch 3.5.15, which is the latest affected version by issue 18089.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok.
tests/robustness/traffic/traffic.go
Outdated
@@ -37,7 +37,7 @@ var ( | |||
RequestTimeout = 200 * time.Millisecond | |||
WatchTimeout = time.Second | |||
MultiOpTxnOpCount = 4 | |||
CompactionPeriod = 200 * time.Millisecond | |||
CompactionPeriod = 100 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I make this a configurable value in Profile? For example,
type Profile struct {
MinimalQPS float64
MaximalQPS float64
MaxNonUniqueRequestConcurrency int
ClientCount int
ForbidCompaction bool
CompactPeriod time.Duration
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, how much does it impact reproducability? How many runs out of 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run and have seen that with 100-200ms reproduction is similar. 3-4 out of 100 runs
GO_TEST_FLAGS='-run=TestRobustnessRegression/Issue18089 -count 100 --bin-dir=/tmp/etcd-v3.5.12-beforeSendWatchResponse/bin' make test-robustness 2>&1 | grep 'FAIL: TestRobustnessRegression/Issue18089'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, setting it to 50ms increases reproducibility 2-3x. Yes it would make sense to make compaction period a parameter.
e2e.WithCompactionBatchLimit(300), | ||
e2e.WithSnapshotCount(1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those 2 parameters critical for repro? Can you check how this parameters impacts reproducibility? I'm looking for minimal set of important parameters that are required for repro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 don't seem to affect reproducibility much. Will git rid of them.
Can you add the command for repro to https://github.com/etcd-io/etcd/tree/main/tests/robustness#robustness-track-record? cc @fuweid as person that knows most about the #18089 Do you think the proposed traffic and failpoint makes sense to reproduce #18089? |
Added the make command. |
Can you squash the commits? |
1) Use SleepBeforeSendWatchResponse failpoint to simulate slow watch 2) Decrease compact period from 200ms to 100ms to increase the probability of compacting on Delete 3) Introduce a new traffic pattern of 50/50 Put and Delete With these three changes the `make test-robustness-issue18089` command can reproduce issue 18089. Signed-off-by: Jiayin Mao <[email protected]>
1827451
to
7590a7e
Compare
Done |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jmao-dd, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
This is just an idea about how to reproduce issue 18089 using robustness tests.
With a few tweaks I was able to reproduce the issue by
In my local environment it usually takes 2 to 10 runs (< 1min) to reproduce the issue.