Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] kind-ipv6-master-parallel #83903

Closed
hasheddan opened this issue Oct 14, 2019 · 19 comments
Closed

[Flaky Test] kind-ipv6-master-parallel #83903

hasheddan opened this issue Oct 14, 2019 · 19 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@hasheddan
Copy link
Contributor

Which jobs are failing: kind-ipv6-master-parallel

Which test(s) are failing:
The primary concern is Overall failing when starting up the kind cluster, but the following tests have also failed intermittently:

  • [k8s.io] [sig-node] Pods Extended [k8s.io] Delete Grace Period should be submitted and removed [Conformance]
  • [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] Should recreate evicted statefulset [Conformance]

Since when has it been failing: seeing the Overall failure as early as 9/30

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-blocking#kind-ipv6-master-parallel

Reason for failure:

W1001 01:12:24.390] [control-plane] Crea
W1001 01:12:24.390]  ✗ Starting control-plane 🕹️
W1001 01:12:24.391] ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: signal: broken pipe
W1001 01:12:24.393] 
W1001 01:12:24.393] Output:
W1001 01:12:24.393] I1001 01:12:18.884072     101 initconfiguration.go:190] loading configuration from "/kind/kubeadm.conf"
W1001 01:12:24.393] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta2, Kind=JoinConfiguration
W1001 01:12:24.393] [init] Using Kubernetes version: v1.17.0-alpha.0.1939+c9f8b8b52526a7
skipped 145 lines unfold_more
W1001 01:12:32.637]   File "/workspace/./test-infra/jenkins/../scenarios/execute.py", line 30, in check
W1001 01:12:32.637]     subprocess.check_call(cmd)
W1001 01:12:32.637]   File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
W1001 01:12:32.638]     raise CalledProcessError(retcode, cmd)
W1001 01:12:32.638] subprocess.CalledProcessError: Command '('./../../sigs.k8s.io/kind/hack/ci/e2e.sh',)' returned non-zero exit status 1
E1001 01:12:32.649] Command failed
I1001 01:12:32.650] process 553 exited with code 1 after 7.2m
E1001 01:12:32.650] FAIL: ci-kubernetes-kind-ipv6-e2e-parallel

Anything else we need to know:
I did some light investigation and I can see that it appears the failure is occurring when initializing kubeadm here. This might just be a situation where we want to bump the number of failures until alert :)

/priority important-soon
/milestone v1.17
/kind flake
/sig testing

/cc @alenkacz @droslean @Verolop @epk @aojea @BenTheElder

@hasheddan hasheddan added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 14, 2019
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 14, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 14, 2019
@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 14, 2019
@liggitt
Copy link
Member

liggitt commented Oct 14, 2019

cc @BenTheElder for kind
cc @neolit123 for kubeadm

@aojea
Copy link
Member

aojea commented Oct 14, 2019

I was chasing flakes in this job and we have a fix ready to merge to solve some of them,
The problem is not related to the kubeadm init process, more info in the following issue kubernetes-sigs/kind#944

My bad, my apologies, I just found the errors mentioned in this issue and seems totally legit, will try to narrow it down

@aojea
Copy link
Member

aojea commented Oct 14, 2019

/assign

@neolit123
Copy link
Member

i think the flakes are on the /test/e2e side and not on the deployer sides (kind / kubeadm).

@aojea
Copy link
Member

aojea commented Oct 14, 2019

i think the flakes are on the /test/e2e side and not on the deployer sides (kind / kubeadm).

Indeed, main flakes are on the e2e side, however, I realized that there are 3 failures in the last 200 runs in the provisioning phase.

Let's fix first the main flakes on the e2e are, those are the ones that are causing the instability.

@BenTheElder
Copy link
Member

We do indeed have occasional flakes on bringup, however I want to note that the tests possibly being more flaky may in fact be due to this being the only IPv6 coverage and we should try to keep that, otherwise they will certainly be worse.

Indeed, main flakes are on the e2e side, however, I realized that there are 3 failures in the last 200 runs in the provisioning phase.

1.5% is not enough to worry about much. That's a wonderful rate for kubernetes e2e :-)
We should fix this, but that's not worth a kubernetes/kubernetes bug, and we are already tracking this in the kind repo.

The other flakes that are at a much higher rate are the tests

@BenTheElder
Copy link
Member

I would also be happy to simply exclude the N flakiest tests if it means we continue to monitor IPv6 functioning overall while we continue to work on that unrelated 1.5% => 0%.

@BenTheElder
Copy link
Member

I don't see a comparable test-suite running "on cloud" in release-master-blocking... that seems somewhat alarming. Going to raise this with SIG-release.

@BenTheElder
Copy link
Member

Tracking the ~1.5% kubeadm init (or docker exec? TBD) flakes here kubernetes-sigs/kind#949

Unrelated to that some of these test cases are just flaky [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] Should recreate evicted statefulset [Conformance] is also flaking on GCE in addition to kind.

https://testgrid.k8s.io/conformance-gce#GCE,%20master%20(dev)&width=5
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-gce-conformance-latest/1183014193268264960

I don't think the test cases are flaking due to KIND.

@aojea
Copy link
Member

aojea commented Oct 16, 2019

I've found another problem with the ipv6 tests, now that we fixed the kindnet issues, seems that there are a problem with iptables locks in kube-proxy

creating chain "KUBE-FORWARD": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:12.454753624Z stderr F E1016 07:31:12.431789       1 proxier.go:728] Failed to ensure that nat chain KUBE-SERVICES exists: error creating chain "KUBE-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:20.948920599Z stderr F E1016 07:31:20.946122       1 proxier.go:736] Failed to ensure that filter chain FORWARD jumps to KUBE-FORWARD: error checking rule: exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:26.149562861Z stderr F E1016 07:31:26.136238       1 proxier.go:728] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:31.630348819Z stderr F E1016 07:31:31.629192       1 proxier.go:728] Failed to ensure that filter chain KUBE-EXTERNAL-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:42.118248633Z stderr F E1016 07:31:42.034709       1 proxier.go:736] Failed to ensure that filter chain INPUT jumps to KUBE-SERVICES: error checking rule: exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.

I have to dig more into this, is it possible that the ip6tables-restore implementation is slower than the ipv4 one?
@danwinship you were refactoring this code in kube-proxy recently, any idea on how can I try to pinpoint the problem here?

@danwinship
Copy link
Contributor

There's no reason ip6tables-restore should be any slower than the IPv4 version.
Can you point me to the logs that that came from?

@aojea
Copy link
Member

aojea commented Oct 16, 2019

There's no reason ip6tables-restore should be any slower than the IPv4 version.
Can you point me to the logs that that came from?

This is an example of one failing job

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/sigs.k8s.io_kind/954/pull-kind-conformance-parallel-ipv6/1184365445528948736/

and these are the logs of kube-proxy in one of the workers nodes with the errors mentioned before

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/sigs.k8s.io_kind/954/pull-kind-conformance-parallel-ipv6/1184365445528948736/artifacts/logs/kind-worker/containers/kube-proxy-5wfgh_kube-system_kube-proxy-9e0d81d3a87cee09f7919029aa4b41c8122dfb2ee17ac3889e173e4dd4ddd251.log

@aojea
Copy link
Member

aojea commented Oct 16, 2019

Another flaky with iptables errors, this time wasn't ip6tables-restore , is the new kindnet that uses go-iptables

2019-10-16T18:20:33.983136705Z stdout F Node kind-worker2 has CIDR fd00:10:244:0:1::/80 
2019-10-16T18:20:39.810650256Z stderr F panic: Can't synchronize rules after 3 attempts: running [/sbin/ip6tables -t nat -C KIND-MASQ-AGENT -j MASQUERADE -m comment --comment kind-masq-agent: outbound traffic is subject to MASQUERADE (must be last in chain) --wait]: exit status 4: ip6tables: Resource temporarily unavailable.
2019-10-16T18:20:39.810729436Z stderr F 
2019-10-16T18:20:39.810738416Z stderr F 
2019-10-16T18:20:39.810745057Z stderr F goroutine 20 [running]:
2019-10-16T18:20:39.810754027Z stderr F main.main.func1(0xc00029e150)
2019-10-16T18:20:39.810761796Z stderr F 	/src/cmd/kindnetd/main.go:79 +0x93
2019-10-16T18:20:39.810768298Z stderr F created by main.main
2019-10-16T18:20:39.810789908Z stderr F 	/src/cmd/kindnetd/main.go:77 +0x32a

Job link https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-conformance-parallel-ipv6/1184528903125864450

Log https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kind-conformance-parallel-ipv6/1184528903125864450/artifacts/logs/kind-worker/pods/kube-system_kindnet-7c82z_7a2d7628-c682-42b2-bdeb-b3a60420969c/kindnet-cni/0.log

@droslean
Copy link
Member

bc25814...6a5f0e6 seems to fix the problem.

@aojea
Copy link
Member

aojea commented Oct 18, 2019

@droslean I can't see the relationship of that diff with the problem of iptables holding the lock, can you expand a bit more?

@droslean
Copy link
Member

We noticed that after those commits, the problem was resolved. For more details about the job, please check https://k8s-testgrid.appspot.com/sig-release-master-blocking#kind-ipv6-master-parallel

@aojea
Copy link
Member

aojea commented Oct 18, 2019

ahh well, that's a good observation but not necessarily a definitive indication, there were longer periods before without any failure :-)

If you can see in Ben's and previous comments in this thread #83903 (comment) this is something we are monitoring closely for a long time, and we are aware that there are different things that are creating flakiness to a greater or lesser degree.

We have identified some of them and merged patches in kind and test-infra that are clearly improving the situation.

Anyway, I think that you are right about stability and we can close this issue.

/close

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

ahh well, that's a good observation but not necessarily a definitive indication, there were longer periods before without any failure :-)

If you can see in Ben's and previous comments in this thread #83903 (comment) this is something we are monitoring closely for a long time, and we are aware that there are different things that are creating flakiness to a greater or lesser degree.

We have identified some of them and merged patches in kind and test-infra that are clearly improving the situation.

Anyway, I think that you are right about stability and we can close this issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member

W1001 01:12:24.391] ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: signal: broken pipe

TLDR pretty sure this is permanently fixed now FWIW, we had fun tracking this down 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

8 participants