Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Robustness test to reproduce issue 18089 #19169

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

jmao-dd
Copy link
Contributor

@jmao-dd jmao-dd commented Jan 11, 2025

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

This is just an idea about how to reproduce issue 18089 using robustness tests.

With a few tweaks I was able to reproduce the issue by

  • issuing more Delete requests
  • running frequent compact
  • simulating a slow watch

In my local environment it usually takes 2 to 10 runs (< 1min) to reproduce the issue.

make test-robustness-issue18089
...
logger.go:146: 2025-01-11T00:36:07.292Z     ERROR   Broke watch guarantee   {"guarantee": "resumable", "client": 9, "request": {"Key":"/registry/pods/","Revision":539,"WithPrefix":true,"WithProgressNotify":true,"WithPrevKV":true}, "got-event": {"Type":"put-operation","Key":"/registry/pods/default/jzIW7","Value":{"Value":"563","Hash":0},"Revision":540,"IsCreate":false,"PrevValue":{"Value":{"Value":"560","Hash":0},"ModRevision":537}}, "want-event": {"Type":"delete-operation","Key":"/registry/pods/default/DOj8f","Value":{"Value":"","Hash":0},"Revision":539,"IsCreate":false}}
    validate.go:48: Failed validating watch history, err: broke Resumable - A broken watch can be resumed by establishing a new watch starting after the last revision received in a watch event before the break, so long as the revision is in the history window
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Validating serializable operations
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Saving robustness test report   {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721"}
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Saving member data dir  {"member": "TestRobustnessRegressionIssue18089-test-0", "path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/server-TestRobustnessRegressionIssue18089-test-0"}
    logger.go:146: 2025-01-11T00:36:07.292Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-1/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.293Z     INFO    no KV operations for client, skip persisting    {"client-id": 1}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    no watch operations for client, skip persisting {"client-id": 2}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-2/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-3/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-3/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.294Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-4/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.295Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-4/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.295Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-5/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.296Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-5/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.296Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-6/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.297Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-6/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.297Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-7/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.297Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-7/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.298Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-8/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.298Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-8/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.299Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-9/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.299Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-9/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.299Z     INFO    Saving watch operations {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-10/watch.json"}
    logger.go:146: 2025-01-11T00:36:07.300Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-10/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.301Z     INFO    no watch operations for client, skip persisting {"client-id": 11}
    logger.go:146: 2025-01-11T00:36:07.301Z     INFO    Saving operation history        {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/client-11/operations.json"}
    logger.go:146: 2025-01-11T00:36:07.301Z     INFO    Saving visualization    {"path": "/tmp/TestRobustnessRegression_Issue18089/1736555767292791721/history.html"}
    logger.go:146: 2025-01-11T00:36:08.015Z     INFO    killing server...       {"name": "TestRobustnessRegressionIssue18089-test-0"}
    logger.go:146: 2025-01-11T00:36:08.015Z     INFO    stopping server...      {"name": "TestRobustnessRegressionIssue18089-test-0"}
    logger.go:146: 2025-01-11T00:36:08.018Z     INFO    stopped server. {"name": "TestRobustnessRegressionIssue18089-test-0"}
--- FAIL: TestRobustnessRegression (6.24s)
    --- FAIL: TestRobustnessRegression/Issue18089 (6.21s)
FAIL
FAIL    go.etcd.io/etcd/tests/v3/robustness     54.537s
FAIL
FAIL: (code:1):
  % (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/robustness' '-timeout=30m' '-v' '--run=TestRobustnessRegression/Issue18089' '--count' '100' '--failfast' '--bin-dir=/tmp/etcd-v3.5.15-failpoints/bin')
ERROR: Tests for following packages failed:
   go.etcd.io/etcd/tests/v3/robustness
FAIL: 'robustness' FAILED at Sat 11 Jan 2025 12:36:08 AM UTC
make[1]: *** [Makefile:59: test-robustness] Error 255
make[1]: Leaving directory '/home/---/jmao/etcd'
Successful reproduction

@k8s-ci-robot
Copy link

Hi @jmao-dd. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tests/robustness/traffic/traffic.go Outdated Show resolved Hide resolved
tests/robustness/traffic/etcd.go Outdated Show resolved Hide resolved
tests/robustness/traffic/traffic.go Outdated Show resolved Hide resolved
@@ -71,7 +71,7 @@ func watchUntilRevision(ctx context.Context, t *testing.T, c *RecordingClient, m
defer cancel()
resetWatch:
for {
watch := c.Watch(ctx, "", lastRevision+1, true, true, false)
watch := c.Watch(ctx, "", lastRevision, true, true, false)
Copy link
Contributor Author

@jmao-dd jmao-dd Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changed to simulate a slow watcher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This +1 is intended to simulate a watch whole history of revisions, while the single watch request can break. Don't think this is a good place to simulate slow watcher, let's discuss other ways to simulate slow watcher.

My quick idea would be to add a sleep when processing a watch event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, will try adding sleep to watcher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a failpoint for that SleepBeforeSendWatchResponse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commit of following changes

  • SleepBeforeSendWatchResponse failpoint
  • 100ms compact period (from originally 200ms)
  • 50/50 Put and Delete traffic

I executed the command 3 times and reproduced the issue in 51 seconds, 71 seconds and 69 seconds. Each round of test takes about 4.5 seconds in my env so it was about 10 to 16 rounds.

If I change compact period back to 200ms sometimes it fails to reproduce the issue.

{Choice: KubernetesCreate, Weight: 5},
{Choice: KubernetesUpdate, Weight: 50},
{Choice: KubernetesDelete, Weight: 25},
{Choice: KubernetesCreate, Weight: 25},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to simulate high Delete traffic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you picked Kubernetes traffic to reproduce the issue? I haven't finished the implementation of Kubernetes compaction implementation, but long term Kubernetes traffic will not be compacting on Delete to align with real Kubernetes behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. I can definitely switch to EtcdPut traffic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, thanks.

@@ -37,7 +37,7 @@ var (
RequestTimeout = 200 * time.Millisecond
WatchTimeout = time.Second
MultiOpTxnOpCount = 4
CompactionPeriod = 200 * time.Millisecond
CompactionPeriod = 50 * time.Millisecond
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to simulate high compact frequency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running compact creates additional load, which in CI can cause the test to not hit our required QPS numbers. We should double check if increasing frequency is ok before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. With 200ms compact period it was difficult for me to reproduce the issue since the chance of a compaction happening on a Delete request was low. I can try adding sleep to watcher while keeping compact period at 200ms.

@serathius
Copy link
Member

@serathius
Copy link
Member

In my local environment it usually takes 2 to 10 runs (< 1min) to reproduce the issue.

This is really great result, we just need to better to integrate it with the test suite to avoid introducing flakiness.

@jmao-dd
Copy link
Contributor Author

jmao-dd commented Jan 13, 2025

Please fix DCO, https://github.com/etcd-io/etcd/pull/19169/checks?check_run_id=35481554576

Thanks for the reminder. Will sign my commit in a real PR. This PR is just a POC and I will close it later.

@serathius
Copy link
Member

Thanks for the reminder. Will sign my commit in a real PR. This PR is just a POC and I will close it later.

No need to create a separate PR, you can just push a new commit to the same branch using --force like git push origin HEAD:robusttest-18089 -f

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.80%. Comparing base (75f2ae1) to head (7590a7e).
Report is 52 commits behind head on main.

Additional details and impacted files

see 33 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19169      +/-   ##
==========================================
+ Coverage   68.74%   68.80%   +0.06%     
==========================================
  Files         420      420              
  Lines       35647    35629      -18     
==========================================
+ Hits        24504    24514      +10     
+ Misses       9713     9697      -16     
+ Partials     1430     1418      -12     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75f2ae1...7590a7e. Read the comment docs.

@jmao-dd jmao-dd force-pushed the jmao/robusttest-18089 branch from 3543858 to 43bfd19 Compare January 14, 2025 03:33
@jmao-dd jmao-dd changed the title [POC] Add Robustness test to reproduce issue 18089 Add Robustness test to reproduce issue 18089 Jan 14, 2025
@jmao-dd jmao-dd marked this pull request as ready for review January 14, 2025 03:34
@@ -49,6 +49,11 @@ test-robustness-issue17780: /tmp/etcd-v3.5.13-compactBeforeSetFinishedCompact/bi
GO_TEST_FLAGS='-v --run=TestRobustnessRegression/Issue17780 --count 200 --failfast --bin-dir=/tmp/etcd-v3.5.13-compactBeforeSetFinishedCompact/bin' make test-robustness && \
echo "Failed to reproduce" || echo "Successful reproduction"

.PHONY: test-robustness-issue18089
test-robustness-issue18089: /tmp/etcd-v3.5.12-beforeSendWatchResponse/bin
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose etcd-v3.5.12-beforeSendWatchResponse since only this version has beforeSendWatchResponse fail point patched. Please let me know if I should patch 3.5.15, which is the latest affected version by issue 18089.

Copy link
Member

@serathius serathius Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok.

@@ -37,7 +37,7 @@ var (
RequestTimeout = 200 * time.Millisecond
WatchTimeout = time.Second
MultiOpTxnOpCount = 4
CompactionPeriod = 200 * time.Millisecond
CompactionPeriod = 100 * time.Millisecond
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I make this a configurable value in Profile? For example,

type Profile struct {
	MinimalQPS                     float64
	MaximalQPS                     float64
	MaxNonUniqueRequestConcurrency int
	ClientCount                    int
	ForbidCompaction               bool
        CompactPeriod              time.Duration
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, how much does it impact reproducability? How many runs out of 20?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run and have seen that with 100-200ms reproduction is similar. 3-4 out of 100 runs

GO_TEST_FLAGS='-run=TestRobustnessRegression/Issue18089 -count 100 --bin-dir=/tmp/etcd-v3.5.12-beforeSendWatchResponse/bin' make test-robustness 2>&1 | grep 'FAIL: TestRobustnessRegression/Issue18089'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, setting it to 50ms increases reproducibility 2-3x. Yes it would make sense to make compaction period a parameter.

Comment on lines 233 to 234
e2e.WithCompactionBatchLimit(300),
e2e.WithSnapshotCount(1000),
Copy link
Member

@serathius serathius Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those 2 parameters critical for repro? Can you check how this parameters impacts reproducibility? I'm looking for minimal set of important parameters that are required for repro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 don't seem to affect reproducibility much. Will git rid of them.

@jmao-dd jmao-dd requested a review from serathius January 15, 2025 03:05
@serathius
Copy link
Member

serathius commented Jan 15, 2025

Can you add the command for repro to https://github.com/etcd-io/etcd/tree/main/tests/robustness#robustness-track-record?

cc @fuweid as person that knows most about the #18089

Do you think the proposed traffic and failpoint makes sense to reproduce #18089?

@jmao-dd
Copy link
Contributor Author

jmao-dd commented Jan 15, 2025

Can you add the command for repro to https://github.com/etcd-io/etcd/tree/main/tests/robustness#robustness-track-record?

cc @fuweid as person that knows most about the #18089

Do you think the proposed traffic and failpoint makes sense to reproduce #18089?

Added the make command.

@serathius
Copy link
Member

Can you squash the commits?

1) Use SleepBeforeSendWatchResponse failpoint to simulate slow watch
2) Decrease compact period from 200ms to 100ms to increase the probability of compacting on Delete
3) Introduce a new traffic pattern of 50/50 Put and Delete

With these three changes the `make test-robustness-issue18089` command can reproduce issue 18089.

Signed-off-by: Jiayin Mao <[email protected]>
@jmao-dd jmao-dd force-pushed the jmao/robusttest-18089 branch from 1827451 to 7590a7e Compare January 16, 2025 12:51
@jmao-dd
Copy link
Contributor Author

jmao-dd commented Jan 16, 2025

Can you squash the commits?

Done

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmao-dd, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius serathius merged commit 76aadef into etcd-io:main Jan 16, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants