Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MemberDowngradeUpgrade failpoint #19125

Merged
merged 1 commit into from
Jan 15, 2025

Conversation

siyuanfoundation
Copy link
Contributor

@siyuanfoundation siyuanfoundation commented Jan 3, 2025

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

#17118

The MemberDowngradeUpgrade failpoint first downgrades all members, and upgrades some members. This is to test no robustness issue during the whole downgrade-upgrade process, which can be stopped at any point.

Tested locally with

go test -run TestRobustnessExploratory -v --count 100 --failfast --timeout 5h

There is a little flakiness due to reasons unrelated to what the test wants to do, like a server fails to start. But that's mainly because we are stopping and restarting the servers many times. I think <5% is probably acceptable.

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

codecov bot commented Jan 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.74%. Comparing base (ebb2b06) to head (8f51613).
Report is 13 commits behind head on main.

Additional details and impacted files

see 27 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19125      +/-   ##
==========================================
- Coverage   68.83%   68.74%   -0.09%     
==========================================
  Files         420      420              
  Lines       35641    35627      -14     
==========================================
- Hits        24532    24493      -39     
- Misses       9687     9713      +26     
+ Partials     1422     1421       -1     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebb2b06...8f51613. Read the comment docs.

@siyuanfoundation
Copy link
Contributor Author

/cc @ahrtr @serathius

@ahrtr
Copy link
Member

ahrtr commented Jan 11, 2025

like a server fails to start. But that's mainly because we are stopping and restarting the servers many times.

5% flakly on github or prow might be acceptable, because the test environment is out of our control, but in your local environment, it is still a little concerning.

Overall the PR looks good. Thanks for the nice work.

if err != nil {
return nil, err
}
lastVersion := &semver.Version{Major: currentVersion.Major, Minor: currentVersion.Minor - 1}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: is it worth making a method that calculates this? I know it's a simple line of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to get it from the last release binary.

Comment on lines +221 to +222
DialKeepAliveTime: 10 * time.Second,
DialKeepAliveTimeout: 100 * time.Millisecond,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we work out these values? I might use (named) constants here, even in a test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are default options we use in robustness test. I had a pending PR to unify them, but it's not high priority.

Comment on lines 265 to 263
v3_6 := semver.Version{Major: 3, Minor: 6}
// only current version cluster can be downgraded.
return v.Compare(v3_6) >= 0 && (config.Version == e2e.CurrentVersion && member.Config().ExecPath == e2e.BinPath.Etcd)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might become fragile. Are we making an assumption about etcd major / minor version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, downgrade is only available since 3.6

@sftim
Copy link

sftim commented Jan 13, 2025

Thanks for working on this @siyuanfoundation

@siyuanfoundation
Copy link
Contributor Author

/retest

})
} else {
// for upgrade, the cluster version will be changed to the higher version if all the members have finished upgrading.
if membersChanged == len(clus.Procs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe validate it outside for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

continue
}
break
func downgradeUpgradeMembers(ctx context.Context, t *testing.T, lg *zap.Logger, clus *e2e.EtcdProcessCluster, numberOfMembersToChange int, currentVersion, targetVersion *semver.Version) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much this code differs from e2e test? Could be merge them? I would be worried that there is some small change in procedure and we forget to update two places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged code into e2e.DowngradeUpgradeMembers

tests/framework/e2e/downgrade.go Dismissed Show resolved Hide resolved
Comment on lines +57 to +59
if lg == nil {
lg = clus.lg
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to leave caller to decide logger. This defaulting obscures what logger is used from caller.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, serathius, siyuanfoundation

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius serathius merged commit ce4b4e5 into etcd-io:main Jan 15, 2025
34 checks passed
@ahrtr
Copy link
Member

ahrtr commented Jan 15, 2025

@siyuanfoundation with this PR being merged, I assume that it's easy to finish #17976. Can you lead & drive the #17976? thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants