-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test] kind-ipv6-master-parallel #83903
Comments
cc @BenTheElder for kind |
I was chasing flakes in this job and we have a fix ready to merge to solve some of them, My bad, my apologies, I just found the errors mentioned in this issue and seems totally legit, will try to narrow it down |
/assign |
i think the flakes are on the |
Indeed, main flakes are on the e2e side, however, I realized that there are 3 failures in the last 200 runs in the provisioning phase. Let's fix first the main flakes on the e2e are, those are the ones that are causing the instability. |
We do indeed have occasional flakes on bringup, however I want to note that the tests possibly being more flaky may in fact be due to this being the only IPv6 coverage and we should try to keep that, otherwise they will certainly be worse.
1.5% is not enough to worry about much. That's a wonderful rate for kubernetes e2e :-) The other flakes that are at a much higher rate are the tests |
I would also be happy to simply exclude the N flakiest tests if it means we continue to monitor IPv6 functioning overall while we continue to work on that unrelated 1.5% => 0%. |
I don't see a comparable test-suite running "on cloud" in release-master-blocking... that seems somewhat alarming. Going to raise this with SIG-release. |
Tracking the ~1.5% Unrelated to that some of these test cases are just flaky https://testgrid.k8s.io/conformance-gce#GCE,%20master%20(dev)&width=5 I don't think the test cases are flaking due to KIND. |
I've found another problem with the ipv6 tests, now that we fixed the kindnet issues, seems that there are a problem with iptables locks in kube-proxy
I have to dig more into this, is it possible that the ip6tables-restore implementation is slower than the ipv4 one? |
There's no reason ip6tables-restore should be any slower than the IPv4 version. |
This is an example of one failing job and these are the logs of kube-proxy in one of the workers nodes with the errors mentioned before |
Another flaky with iptables errors, this time wasn't ip6tables-restore , is the new kindnet that uses go-iptables
|
bc25814...6a5f0e6 seems to fix the problem. |
@droslean I can't see the relationship of that diff with the problem of iptables holding the lock, can you expand a bit more? |
We noticed that after those commits, the problem was resolved. For more details about the job, please check https://k8s-testgrid.appspot.com/sig-release-master-blocking#kind-ipv6-master-parallel |
ahh well, that's a good observation but not necessarily a definitive indication, there were longer periods before without any failure :-) If you can see in Ben's and previous comments in this thread #83903 (comment) this is something we are monitoring closely for a long time, and we are aware that there are different things that are creating flakiness to a greater or lesser degree. We have identified some of them and merged patches in kind and test-infra that are clearly improving the situation. Anyway, I think that you are right about stability and we can close this issue. /close |
@aojea: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
TLDR pretty sure this is permanently fixed now FWIW, we had fun tracking this down 😅 |
Which jobs are failing: kind-ipv6-master-parallel
Which test(s) are failing:
The primary concern is
Overall
failing when starting up the kind cluster, but the following tests have also failed intermittently:Since when has it been failing: seeing the
Overall
failure as early as 9/30Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-blocking#kind-ipv6-master-parallel
Reason for failure:
Anything else we need to know:
I did some light investigation and I can see that it appears the failure is occurring when initializing
kubeadm
here. This might just be a situation where we want to bump the number of failures until alert :)/priority important-soon
/milestone v1.17
/kind flake
/sig testing
/cc @alenkacz @droslean @Verolop @epk @aojea @BenTheElder
The text was updated successfully, but these errors were encountered: