[Flaky Test] kind-ipv6-master-parallel #83903

hasheddan · 2019-10-14T13:31:13Z

Which jobs are failing: kind-ipv6-master-parallel

Which test(s) are failing:
The primary concern is Overall failing when starting up the kind cluster, but the following tests have also failed intermittently:

[k8s.io] [sig-node] Pods Extended [k8s.io] Delete Grace Period should be submitted and removed [Conformance]
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] Should recreate evicted statefulset [Conformance]

Since when has it been failing: seeing the Overall failure as early as 9/30

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-blocking#kind-ipv6-master-parallel

Reason for failure:

W1001 01:12:24.390] [control-plane] Crea
W1001 01:12:24.390]  ✗ Starting control-plane 🕹️
W1001 01:12:24.391] ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: signal: broken pipe
W1001 01:12:24.393] 
W1001 01:12:24.393] Output:
W1001 01:12:24.393] I1001 01:12:18.884072     101 initconfiguration.go:190] loading configuration from "/kind/kubeadm.conf"
W1001 01:12:24.393] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta2, Kind=JoinConfiguration
W1001 01:12:24.393] [init] Using Kubernetes version: v1.17.0-alpha.0.1939+c9f8b8b52526a7
skipped 145 lines unfold_more
W1001 01:12:32.637]   File "/workspace/./test-infra/jenkins/../scenarios/execute.py", line 30, in check
W1001 01:12:32.637]     subprocess.check_call(cmd)
W1001 01:12:32.637]   File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
W1001 01:12:32.638]     raise CalledProcessError(retcode, cmd)
W1001 01:12:32.638] subprocess.CalledProcessError: Command '('./../../sigs.k8s.io/kind/hack/ci/e2e.sh',)' returned non-zero exit status 1
E1001 01:12:32.649] Command failed
I1001 01:12:32.650] process 553 exited with code 1 after 7.2m
E1001 01:12:32.650] FAIL: ci-kubernetes-kind-ipv6-e2e-parallel

Anything else we need to know:
I did some light investigation and I can see that it appears the failure is occurring when initializing kubeadm here. This might just be a situation where we want to bump the number of failures until alert :)

/priority important-soon
/milestone v1.17
/kind flake
/sig testing

/cc @alenkacz @droslean @Verolop @epk @aojea @BenTheElder

The text was updated successfully, but these errors were encountered:

liggitt · 2019-10-14T14:22:41Z

cc @BenTheElder for kind
cc @neolit123 for kubeadm

aojea · 2019-10-14T14:33:00Z

I was chasing flakes in this job and we have a fix ready to merge to solve some of them,
~~The problem is not related to the kubeadm init process~~, more info in the following issue kubernetes-sigs/kind#944

My bad, my apologies, I just found the errors mentioned in this issue and seems totally legit, will try to narrow it down

aojea · 2019-10-14T14:34:13Z

/assign

neolit123 · 2019-10-14T14:40:12Z

i think the flakes are on the /test/e2e side and not on the deployer sides (kind / kubeadm).

aojea · 2019-10-14T14:56:58Z

i think the flakes are on the /test/e2e side and not on the deployer sides (kind / kubeadm).

Indeed, main flakes are on the e2e side, however, I realized that there are 3 failures in the last 200 runs in the provisioning phase.

Let's fix first the main flakes on the e2e are, those are the ones that are causing the instability.

BenTheElder · 2019-10-14T15:59:20Z

We do indeed have occasional flakes on bringup, however I want to note that the tests possibly being more flaky may in fact be due to this being the only IPv6 coverage and we should try to keep that, otherwise they will certainly be worse.

Indeed, main flakes are on the e2e side, however, I realized that there are 3 failures in the last 200 runs in the provisioning phase.

1.5% is not enough to worry about much. That's a wonderful rate for kubernetes e2e :-)
We should fix this, but that's not worth a kubernetes/kubernetes bug, and we are already tracking this in the kind repo.

The other flakes that are at a much higher rate are the tests

BenTheElder · 2019-10-14T16:06:47Z

I would also be happy to simply exclude the N flakiest tests if it means we continue to monitor IPv6 functioning overall while we continue to work on that unrelated 1.5% => 0%.

BenTheElder · 2019-10-14T16:08:44Z

I don't see a comparable test-suite running "on cloud" in release-master-blocking... that seems somewhat alarming. Going to raise this with SIG-release.

BenTheElder · 2019-10-14T23:00:05Z

Tracking the ~1.5% kubeadm init (or docker exec? TBD) flakes here kubernetes-sigs/kind#949

Unrelated to that some of these test cases are just flaky [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] Should recreate evicted statefulset [Conformance] is also flaking on GCE in addition to kind.

https://testgrid.k8s.io/conformance-gce#GCE,%20master%20(dev)&width=5
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-gce-conformance-latest/1183014193268264960

I don't think the test cases are flaking due to KIND.

aojea · 2019-10-16T08:29:46Z

I've found another problem with the ipv6 tests, now that we fixed the kindnet issues, seems that there are a problem with iptables locks in kube-proxy

creating chain "KUBE-FORWARD": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:12.454753624Z stderr F E1016 07:31:12.431789       1 proxier.go:728] Failed to ensure that nat chain KUBE-SERVICES exists: error creating chain "KUBE-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:20.948920599Z stderr F E1016 07:31:20.946122       1 proxier.go:736] Failed to ensure that filter chain FORWARD jumps to KUBE-FORWARD: error checking rule: exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:26.149562861Z stderr F E1016 07:31:26.136238       1 proxier.go:728] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:31.630348819Z stderr F E1016 07:31:31.629192       1 proxier.go:728] Failed to ensure that filter chain KUBE-EXTERNAL-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
2019-10-16T07:31:42.118248633Z stderr F E1016 07:31:42.034709       1 proxier.go:736] Failed to ensure that filter chain INPUT jumps to KUBE-SERVICES: error checking rule: exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.

I have to dig more into this, is it possible that the ip6tables-restore implementation is slower than the ipv4 one?
@danwinship you were refactoring this code in kube-proxy recently, any idea on how can I try to pinpoint the problem here?

danwinship · 2019-10-16T12:06:29Z

There's no reason ip6tables-restore should be any slower than the IPv4 version.
Can you point me to the logs that that came from?

aojea · 2019-10-16T13:49:21Z

There's no reason ip6tables-restore should be any slower than the IPv4 version.
Can you point me to the logs that that came from?

This is an example of one failing job

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/sigs.k8s.io_kind/954/pull-kind-conformance-parallel-ipv6/1184365445528948736/

and these are the logs of kube-proxy in one of the workers nodes with the errors mentioned before

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/sigs.k8s.io_kind/954/pull-kind-conformance-parallel-ipv6/1184365445528948736/artifacts/logs/kind-worker/containers/kube-proxy-5wfgh_kube-system_kube-proxy-9e0d81d3a87cee09f7919029aa4b41c8122dfb2ee17ac3889e173e4dd4ddd251.log

aojea · 2019-10-16T20:01:22Z

Another flaky with iptables errors, this time wasn't ip6tables-restore , is the new kindnet that uses go-iptables

2019-10-16T18:20:33.983136705Z stdout F Node kind-worker2 has CIDR fd00:10:244:0:1::/80 
2019-10-16T18:20:39.810650256Z stderr F panic: Can't synchronize rules after 3 attempts: running [/sbin/ip6tables -t nat -C KIND-MASQ-AGENT -j MASQUERADE -m comment --comment kind-masq-agent: outbound traffic is subject to MASQUERADE (must be last in chain) --wait]: exit status 4: ip6tables: Resource temporarily unavailable.
2019-10-16T18:20:39.810729436Z stderr F 
2019-10-16T18:20:39.810738416Z stderr F 
2019-10-16T18:20:39.810745057Z stderr F goroutine 20 [running]:
2019-10-16T18:20:39.810754027Z stderr F main.main.func1(0xc00029e150)
2019-10-16T18:20:39.810761796Z stderr F 	/src/cmd/kindnetd/main.go:79 +0x93
2019-10-16T18:20:39.810768298Z stderr F created by main.main
2019-10-16T18:20:39.810789908Z stderr F 	/src/cmd/kindnetd/main.go:77 +0x32a

Job link https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-conformance-parallel-ipv6/1184528903125864450

Log https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kind-conformance-parallel-ipv6/1184528903125864450/artifacts/logs/kind-worker/pods/kube-system_kindnet-7c82z_7a2d7628-c682-42b2-bdeb-b3a60420969c/kindnet-cni/0.log

droslean · 2019-10-17T10:33:51Z

bc25814...6a5f0e6 seems to fix the problem.

aojea · 2019-10-18T11:51:05Z

@droslean I can't see the relationship of that diff with the problem of iptables holding the lock, can you expand a bit more?

droslean · 2019-10-18T12:33:02Z

We noticed that after those commits, the problem was resolved. For more details about the job, please check https://k8s-testgrid.appspot.com/sig-release-master-blocking#kind-ipv6-master-parallel

aojea · 2019-10-18T14:21:17Z

ahh well, that's a good observation but not necessarily a definitive indication, there were longer periods before without any failure :-)

If you can see in Ben's and previous comments in this thread #83903 (comment) this is something we are monitoring closely for a long time, and we are aware that there are different things that are creating flakiness to a greater or lesser degree.

We have identified some of them and merged patches in kind and test-infra that are clearly improving the situation.

Anyway, I think that you are right about stability and we can close this issue.

/close

k8s-ci-robot · 2019-10-18T14:21:18Z

@aojea: Closing this issue.

In response to this:

ahh well, that's a good observation but not necessarily a definitive indication, there were longer periods before without any failure :-)

If you can see in Ben's and previous comments in this thread #83903 (comment) this is something we are monitoring closely for a long time, and we are aware that there are different things that are creating flakiness to a greater or lesser degree.

We have identified some of them and merged patches in kind and test-infra that are clearly improving the situation.

Anyway, I think that you are right about stability and we can close this issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BenTheElder · 2019-10-24T04:35:51Z

W1001 01:12:24.391] ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: signal: broken pipe

TLDR pretty sure this is permanently fixed now FWIW, we had fun tracking this down 😅

hasheddan added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 14, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 14, 2019

k8s-ci-robot added this to the v1.17 milestone Oct 14, 2019

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 14, 2019

k8s-ci-robot assigned aojea Oct 14, 2019

BenTheElder mentioned this issue Oct 14, 2019

infrequent docker exec .... signal: broken pipe during kubeadm init | kubeadm join kubernetes-sigs/kind#949

Closed

k8s-ci-robot closed this as completed Oct 18, 2019

aojea mentioned this issue Oct 20, 2019

[Flaky Test] kind-ipv6-master-parallel kubernetes-sigs/kind#976

Closed

danwinship mentioned this issue Nov 15, 2019

update images to Debian buster, detect iptables mode #82966

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test] kind-ipv6-master-parallel #83903

[Flaky Test] kind-ipv6-master-parallel #83903

hasheddan commented Oct 14, 2019

liggitt commented Oct 14, 2019

aojea commented Oct 14, 2019 •

edited

Loading

aojea commented Oct 14, 2019

neolit123 commented Oct 14, 2019

aojea commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

aojea commented Oct 16, 2019 •

edited

Loading

danwinship commented Oct 16, 2019

aojea commented Oct 16, 2019

aojea commented Oct 16, 2019

droslean commented Oct 17, 2019

aojea commented Oct 18, 2019

droslean commented Oct 18, 2019

aojea commented Oct 18, 2019

k8s-ci-robot commented Oct 18, 2019

BenTheElder commented Oct 24, 2019

[Flaky Test] kind-ipv6-master-parallel #83903

[Flaky Test] kind-ipv6-master-parallel #83903

Comments

hasheddan commented Oct 14, 2019

liggitt commented Oct 14, 2019

aojea commented Oct 14, 2019 • edited Loading

aojea commented Oct 14, 2019

neolit123 commented Oct 14, 2019

aojea commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

aojea commented Oct 16, 2019 • edited Loading

danwinship commented Oct 16, 2019

aojea commented Oct 16, 2019

aojea commented Oct 16, 2019

droslean commented Oct 17, 2019

aojea commented Oct 18, 2019

droslean commented Oct 18, 2019

aojea commented Oct 18, 2019

k8s-ci-robot commented Oct 18, 2019

BenTheElder commented Oct 24, 2019

aojea commented Oct 14, 2019 •

edited

Loading

aojea commented Oct 16, 2019 •

edited

Loading