-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add MemberDowngradeUpgrade failpoint #19125
Conversation
Skipping CI for Draft Pull Request. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 27 files with indirect coverage changes @@ Coverage Diff @@
## main #19125 +/- ##
==========================================
- Coverage 68.83% 68.74% -0.09%
==========================================
Files 420 420
Lines 35641 35627 -14
==========================================
- Hits 24532 24493 -39
- Misses 9687 9713 +26
+ Partials 1422 1421 -1 Continue to review full report in Codecov by Sentry.
|
c6e3a18
to
6f8d37f
Compare
/cc @ahrtr @serathius |
5% flakly on github or prow might be acceptable, because the test environment is out of our control, but in your local environment, it is still a little concerning. Overall the PR looks good. Thanks for the nice work. |
if err != nil { | ||
return nil, err | ||
} | ||
lastVersion := &semver.Version{Major: currentVersion.Major, Minor: currentVersion.Minor - 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder: is it worth making a method that calculates this? I know it's a simple line of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to get it from the last release binary.
DialKeepAliveTime: 10 * time.Second, | ||
DialKeepAliveTimeout: 100 * time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we work out these values? I might use (named) constants here, even in a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are default options we use in robustness test. I had a pending PR to unify them, but it's not high priority.
v3_6 := semver.Version{Major: 3, Minor: 6} | ||
// only current version cluster can be downgraded. | ||
return v.Compare(v3_6) >= 0 && (config.Version == e2e.CurrentVersion && member.Config().ExecPath == e2e.BinPath.Etcd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it might become fragile. Are we making an assumption about etcd major / minor version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, downgrade is only available since 3.6
6f8d37f
to
bfafe8c
Compare
bfafe8c
to
f2b4513
Compare
Thanks for working on this @siyuanfoundation |
f2b4513
to
4c9d908
Compare
/retest |
}) | ||
} else { | ||
// for upgrade, the cluster version will be changed to the higher version if all the members have finished upgrading. | ||
if membersChanged == len(clus.Procs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe validate it outside for loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
continue | ||
} | ||
break | ||
func downgradeUpgradeMembers(ctx context.Context, t *testing.T, lg *zap.Logger, clus *e2e.EtcdProcessCluster, numberOfMembersToChange int, currentVersion, targetVersion *semver.Version) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much this code differs from e2e test? Could be merge them? I would be worried that there is some small change in procedure and we forget to update two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged code into e2e.DowngradeUpgradeMembers
Signed-off-by: Siyuan Zhang <[email protected]>
4c9d908
to
8f51613
Compare
if lg == nil { | ||
lg = clus.lg | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to leave caller to decide logger. This defaulting obscures what logger is used from caller.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, serathius, siyuanfoundation The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@siyuanfoundation with this PR being merged, I assume that it's easy to finish #17976. Can you lead & drive the #17976? thx |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
#17118
The MemberDowngradeUpgrade failpoint first downgrades all members, and upgrades some members. This is to test no robustness issue during the whole downgrade-upgrade process, which can be stopped at any point.
Tested locally with
There is a little flakiness due to reasons unrelated to what the test wants to do, like a server fails to start. But that's mainly because we are stopping and restarting the servers many times. I think <5% is probably acceptable.