Add Robustness test to reproduce issue 18089 #19169

jmao-dd · 2025-01-11T00:42:52Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

This is just an idea about how to reproduce issue 18089 using robustness tests.

With a few tweaks I was able to reproduce the issue by

issuing more Delete requests
running frequent compact
simulating a slow watch

In my local environment it usually takes 2 to 10 runs (< 1min) to reproduce the issue.

make test-robustness-issue18089
...
logger.go:146: 2025-01-11T00:36:07.292Z     ERROR   Broke watch guarantee   {"guarantee": "resumable", "client": 9, "request": {"Key":"/registry/pods/","Revision":539,"WithPrefix":true,"WithProgressNotify":true,"WithPrevKV":true}, "got-event": {"Type":"put-operation","Key":"/registry/pods/default/jzIW7","Value":{"Value":"563","Hash":0},"Revision":540,"IsCreate":false,"PrevValue":{"Value":{"Value":"560","Hash":0},"ModRevision":537}}, "want-event": {"Type":"delete-operation","Key":"/registry/pods/default/DOj8f","Value":{"Value":"","Hash":0},"Revision":539,"IsCreate":false}}
    validate.go:48: Failed validating watch history, err: broke Resumable - A broken watch can be resumed by establishing a new watch starting after the last revision received in a watch event before the break, so long as the revision is in the history window
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Validating serializable operations
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Saving robustness test report   {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721"}
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Saving member data dir  {"member": "TestRobustnessRegressionIssue18089-test-0", "path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/server-TestRobustnessRegressionIssue18089-test-0"}
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-1/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.293Z     INFO    no KV operations for client, skip persisting    {"client-id": 1}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    no watch operations for client, skip persisting {"client-id": 2}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-2/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-3/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-3/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-4/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.295Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-4/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.295Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-5/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.296Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-5/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.296Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-6/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.297Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-6/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.297Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-7/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.297Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-7/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.298Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-8/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.298Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-8/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.299Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-9/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.299Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-9/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.299Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-10/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.300Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-10/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.301Z     INFO    no watch operations for client, skip persisting {"client-id": 11}
    logger.go:146: 2025-01-11T00:36:07.301Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-11/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.301Z     INFO    Saving visualization    {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/history.html"}
    logger.go:146: 2025-01-11T00:36:08.015Z     INFO    killing server...       {"name": "TestRobustnessRegressionIssue18089-test-0"}
    logger.go:146: 2025-01-11T00:36:08.015Z     INFO    stopping server...      {"name": "TestRobustnessRegressionIssue18089-test-0"}
    logger.go:146: 2025-01-11T00:36:08.018Z     INFO    stopped server. {"name": "TestRobustnessRegressionIssue18089-test-0"}
--- FAIL: TestRobustnessRegression (6.24s)
    --- FAIL: TestRobustnessRegression/Issue18089 (6.21s)
FAIL
FAIL    go.etcd.io/etcd/tests/v3/robustness     54.537s
FAIL
FAIL: (code:1):
  % (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/robustness' '-timeout=30m' '-v' '--run=TestRobustnessRegression/Issue18089' '--count' '100' '--failfast' '--bin-dir=/tmp/etcd-v3.5.15-failpoints/bin')
ERROR: Tests for following packages failed:
   go.etcd.io/etcd/tests/v3/robustness
FAIL: 'robustness' FAILED at Sat 11 Jan 2025 12:36:08 AM UTC
make[1]: *** [Makefile:59: test-robustness] Error 255
make[1]: Leaving directory '/home/---/jmao/etcd'
Successful reproduction

k8s-ci-robot · 2025-01-11T00:43:01Z

Hi @jmao-dd. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tests/robustness/traffic/traffic.go

tests/robustness/traffic/etcd.go

tests/robustness/traffic/traffic.go

jmao-dd · 2025-01-12T02:18:55Z

tests/robustness/client/watch.go

@@ -71,7 +71,7 @@ func watchUntilRevision(ctx context.Context, t *testing.T, c *RecordingClient, m
 	defer cancel()
 resetWatch:
 	for {
-		watch := c.Watch(ctx, "", lastRevision+1, true, true, false)
+		watch := c.Watch(ctx, "", lastRevision, true, true, false)


This is changed to simulate a slow watcher.

This +1 is intended to simulate a watch whole history of revisions, while the single watch request can break. Don't think this is a good place to simulate slow watcher, let's discuss other ways to simulate slow watcher.

My quick idea would be to add a sleep when processing a watch event.

Cool, will try adding sleep to watcher.

There is already a failpoint for that SleepBeforeSendWatchResponse

With the latest commit of following changes

SleepBeforeSendWatchResponse failpoint

100ms compact period (from originally 200ms)

50/50 Put and Delete traffic

I executed the command 3 times and reproduced the issue in 51 seconds, 71 seconds and 69 seconds. Each round of test takes about 4.5 seconds in my env so it was about 10 to 16 rounds.

If I change compact period back to 200ms sometimes it fails to reproduce the issue.

jmao-dd · 2025-01-12T02:19:24Z

tests/robustness/traffic/kubernetes.go

-		{Choice: KubernetesCreate, Weight: 5},
+		{Choice: KubernetesUpdate, Weight: 50},
+		{Choice: KubernetesDelete, Weight: 25},
+		{Choice: KubernetesCreate, Weight: 25},


This change is to simulate high Delete traffic

Any reason you picked Kubernetes traffic to reproduce the issue? I haven't finished the implementation of Kubernetes compaction implementation, but long term Kubernetes traffic will not be compacting on Delete to align with real Kubernetes behavior.

No specific reason. I can definitely switch to EtcdPut traffic.

SG, thanks.

jmao-dd · 2025-01-12T02:19:49Z

tests/robustness/traffic/traffic.go

@@ -37,7 +37,7 @@ var (
 	RequestTimeout          = 200 * time.Millisecond
 	WatchTimeout            = time.Second
 	MultiOpTxnOpCount       = 4
-	CompactionPeriod        = 200 * time.Millisecond
+	CompactionPeriod        = 50 * time.Millisecond


This change is to simulate high compact frequency.

Running compact creates additional load, which in CI can cause the test to not hit our required QPS numbers. We should double check if increasing frequency is ok before merging.

I see. With 200ms compact period it was difficult for me to reproduce the issue since the chance of a compaction happening on a Delete request was low. I can try adding sleep to watcher while keeping compact period at 200ms.

serathius · 2025-01-13T09:06:30Z

Please fix DCO, https://github.com/etcd-io/etcd/pull/19169/checks?check_run_id=35481554576

serathius · 2025-01-13T09:18:39Z

In my local environment it usually takes 2 to 10 runs (< 1min) to reproduce the issue.

This is really great result, we just need to better to integrate it with the test suite to avoid introducing flakiness.

jmao-dd · 2025-01-13T13:51:52Z

Please fix DCO, https://github.com/etcd-io/etcd/pull/19169/checks?check_run_id=35481554576

Thanks for the reminder. Will sign my commit in a real PR. This PR is just a POC and I will close it later.

serathius · 2025-01-13T14:03:23Z

Thanks for the reminder. Will sign my commit in a real PR. This PR is just a POC and I will close it later.

No need to create a separate PR, you can just push a new commit to the same branch using --force like git push origin HEAD:robusttest-18089 -f

codecov · 2025-01-14T03:30:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.80%. Comparing base (75f2ae1) to head (7590a7e).
Report is 52 commits behind head on main.

Additional details and impacted files

see 33 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19169      +/-   ##
==========================================
+ Coverage   68.74%   68.80%   +0.06%     
==========================================
  Files         420      420              
  Lines       35647    35629      -18     
==========================================
+ Hits        24504    24514      +10     
+ Misses       9713     9697      -16     
+ Partials     1430     1418      -12

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75f2ae1...7590a7e. Read the comment docs.

jmao-dd · 2025-01-14T03:42:28Z

tests/robustness/makefile.mk

@@ -49,6 +49,11 @@ test-robustness-issue17780: /tmp/etcd-v3.5.13-compactBeforeSetFinishedCompact/bi
 	GO_TEST_FLAGS='-v --run=TestRobustnessRegression/Issue17780 --count 200 --failfast --bin-dir=/tmp/etcd-v3.5.13-compactBeforeSetFinishedCompact/bin' make test-robustness && \
 	  echo "Failed to reproduce" || echo "Successful reproduction"

+.PHONY: test-robustness-issue18089
+test-robustness-issue18089: /tmp/etcd-v3.5.12-beforeSendWatchResponse/bin


I chose etcd-v3.5.12-beforeSendWatchResponse since only this version has beforeSendWatchResponse fail point patched. Please let me know if I should patch 3.5.15, which is the latest affected version by issue 18089.

jmao-dd · 2025-01-14T03:46:07Z

tests/robustness/traffic/traffic.go

@@ -37,7 +37,7 @@ var (
 	RequestTimeout          = 200 * time.Millisecond
 	WatchTimeout            = time.Second
 	MultiOpTxnOpCount       = 4
-	CompactionPeriod        = 200 * time.Millisecond
+	CompactionPeriod        = 100 * time.Millisecond


Should I make this a configurable value in Profile? For example,

type Profile struct { MinimalQPS float64 MaximalQPS float64 MaxNonUniqueRequestConcurrency int ClientCount int ForbidCompaction bool CompactPeriod time.Duration }

Maybe, how much does it impact reproducability? How many runs out of 20?

I run and have seen that with 100-200ms reproduction is similar. 3-4 out of 100 runs

GO_TEST_FLAGS='-run=TestRobustnessRegression/Issue18089 -count 100 --bin-dir=/tmp/etcd-v3.5.12-beforeSendWatchResponse/bin' make test-robustness 2>&1 | grep 'FAIL: TestRobustnessRegression/Issue18089'

Ok, setting it to 50ms increases reproducibility 2-3x. Yes it would make sense to make compaction period a parameter.

serathius · 2025-01-14T10:50:21Z

tests/robustness/scenarios/scenarios.go

+			e2e.WithCompactionBatchLimit(300),
+			e2e.WithSnapshotCount(1000),


Are those 2 parameters critical for repro? Can you check how this parameters impacts reproducibility? I'm looking for minimal set of important parameters that are required for repro.

These 2 don't seem to affect reproducibility much. Will git rid of them.

serathius · 2025-01-15T09:44:28Z

Can you add the command for repro to https://github.com/etcd-io/etcd/tree/main/tests/robustness#robustness-track-record?

cc @fuweid as person that knows most about the #18089

Do you think the proposed traffic and failpoint makes sense to reproduce #18089?

jmao-dd · 2025-01-15T13:50:19Z

Can you add the command for repro to https://github.com/etcd-io/etcd/tree/main/tests/robustness#robustness-track-record?

cc @fuweid as person that knows most about the #18089

Do you think the proposed traffic and failpoint makes sense to reproduce #18089?

Added the make command.

serathius · 2025-01-16T12:03:29Z

Can you squash the commits?

1) Use SleepBeforeSendWatchResponse failpoint to simulate slow watch 2) Decrease compact period from 200ms to 100ms to increase the probability of compacting on Delete 3) Introduce a new traffic pattern of 50/50 Put and Delete With these three changes the `make test-robustness-issue18089` command can reproduce issue 18089. Signed-off-by: Jiayin Mao <[email protected]>

jmao-dd · 2025-01-16T12:51:27Z

Can you squash the commits?

Done

k8s-ci-robot · 2025-01-16T15:08:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmao-dd, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress area/robustness-testing area/testing needs-ok-to-test labels Jan 11, 2025

k8s-ci-robot added the size/M label Jan 11, 2025

jmao-dd mentioned this pull request Jan 11, 2025

Add regression test for watch dropping event on compaction #18682

Open

jmao-dd commented Jan 11, 2025

View reviewed changes

tests/robustness/traffic/traffic.go Outdated Show resolved Hide resolved

tests/robustness/traffic/etcd.go Outdated Show resolved Hide resolved

tests/robustness/traffic/traffic.go Outdated Show resolved Hide resolved

k8s-ci-robot added size/S and removed size/M labels Jan 12, 2025

jmao-dd commented Jan 12, 2025

View reviewed changes

k8s-ci-robot added size/M and removed size/S labels Jan 14, 2025

k8s-ci-robot added size/S and removed size/M labels Jan 14, 2025

jmao-dd force-pushed the jmao/robusttest-18089 branch from 3543858 to 43bfd19 Compare January 14, 2025 03:33

jmao-dd changed the title ~~[POC] Add Robustness test to reproduce issue 18089~~ Add Robustness test to reproduce issue 18089 Jan 14, 2025

jmao-dd marked this pull request as ready for review January 14, 2025 03:34

k8s-ci-robot removed the do-not-merge/work-in-progress label Jan 14, 2025

jmao-dd commented Jan 14, 2025

View reviewed changes

serathius reviewed Jan 14, 2025

View reviewed changes

k8s-ci-robot added size/M and removed size/S labels Jan 15, 2025

jmao-dd requested a review from serathius January 15, 2025 03:05

jmao-dd force-pushed the jmao/robusttest-18089 branch from 1827451 to 7590a7e Compare January 16, 2025 12:51

serathius approved these changes Jan 16, 2025

View reviewed changes

k8s-ci-robot added the approved label Jan 16, 2025

serathius merged commit 76aadef into etcd-io:main Jan 16, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Robustness test to reproduce issue 18089 #19169

Add Robustness test to reproduce issue 18089 #19169

jmao-dd commented Jan 11, 2025 •

edited

Loading

k8s-ci-robot commented Jan 11, 2025

jmao-dd Jan 12, 2025 •

edited

Loading

serathius Jan 13, 2025

jmao-dd Jan 13, 2025

serathius Jan 13, 2025

jmao-dd Jan 14, 2025

jmao-dd Jan 12, 2025

serathius Jan 13, 2025

jmao-dd Jan 13, 2025

serathius Jan 13, 2025

jmao-dd Jan 12, 2025

serathius Jan 13, 2025

jmao-dd Jan 13, 2025

serathius commented Jan 13, 2025

serathius commented Jan 13, 2025

jmao-dd commented Jan 13, 2025

serathius commented Jan 13, 2025

codecov bot commented Jan 14, 2025 •

edited

Loading

jmao-dd Jan 14, 2025

serathius Jan 14, 2025 •

edited

Loading

jmao-dd Jan 14, 2025

serathius Jan 14, 2025

serathius Jan 14, 2025

serathius Jan 14, 2025

serathius Jan 14, 2025 •

edited

Loading

jmao-dd Jan 15, 2025

serathius commented Jan 15, 2025 •

edited

Loading

jmao-dd commented Jan 15, 2025

serathius commented Jan 16, 2025

jmao-dd commented Jan 16, 2025

k8s-ci-robot commented Jan 16, 2025

		e2e.WithCompactionBatchLimit(300),
		e2e.WithSnapshotCount(1000),

Add Robustness test to reproduce issue 18089 #19169

Add Robustness test to reproduce issue 18089 #19169

Conversation

jmao-dd commented Jan 11, 2025 • edited Loading

k8s-ci-robot commented Jan 11, 2025

jmao-dd Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Jan 13, 2025

serathius commented Jan 13, 2025

jmao-dd commented Jan 13, 2025

serathius commented Jan 13, 2025

codecov bot commented Jan 14, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

serathius Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Jan 15, 2025 • edited Loading

jmao-dd commented Jan 15, 2025

serathius commented Jan 16, 2025

jmao-dd commented Jan 16, 2025

k8s-ci-robot commented Jan 16, 2025

jmao-dd commented Jan 11, 2025 •

edited

Loading

jmao-dd Jan 12, 2025 •

edited

Loading

codecov bot commented Jan 14, 2025 •

edited

Loading

serathius Jan 14, 2025 •

edited

Loading

serathius Jan 14, 2025 •

edited

Loading

serathius commented Jan 15, 2025 •

edited

Loading