[iOS] Infrastructure is failing to report successful test run #11683

jkotas · 2022-11-20T22:40:53Z

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=89051

Build leg reported

System.Runtime.Tests.WorkItemExecution

Pull Request

dotnet/runtime#78288

Action required for the engineering services team

To triage this issue (First Responder / @dotnet/dnceng):

Open the failing build above and investigate
Add a comment explaining your findings

If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:

Add the label "Known Build Error"
Edit this issue and add an error string in the Json below that can help us match this issue with future build breaks. You should use the known issues documentation

{
   "ErrorMessage" : "[TCP tunnel] Xamarin.Hosting: Failed to connect to port",
   "BuildRetry": true
}

Additional information about the issue reported

System.Runtime.Tests are reported as failing. It looks like a failure in the reporting infrastructure. System.Runtime.Tests succeeded according to the net.dot.System.Runtime.Tests.log file:

Tests run: 50259 Passed: 50025 Inconclusive: 0 Failed: 0 Ignored: 133 Skipped: 101
Killing process 95249 as it was cancelled
[TerminateWithSuccess]

Report

Build	Definition	Test	Pull Request
89138	dotnet/runtime	System.Runtime.Tests.WorkItemExecution	dotnet/runtime#78593

Summary

24-Hour Hit Count	7-Day Hit Count	1-Month Count
1	1	1

The text was updated successfully, but these errors were encountered:

jkotas · 2022-11-21T00:57:53Z

Another hit in dotnet/runtime#78593

premun · 2022-11-21T10:30:19Z

Analysis

I can see that the TCP tunnel between XHarness and the device failed:

[12:46:55] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 56661 on the device and the port 56661 (21981) on the mac: 61
[12:46:55] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 56661 (21981) on device (error: 61)

When this happens, the app just dumps the testResults.xml into its stdout instead the TCP connection (the stdout is then in the net.dot.System.Runtime.Tests.log log). So while the app did perform the actual unit tests fine, the run was considered an APP_CRASH because the app never talked back.

These TCP problems happen every now and then and this is the issue for them: dotnet/xharness#934
We haven't had the time to handle the TCP flakiness yet. It is unclear whether the flakiness comes from MacOS, mlaunch or the runtime/networking stack.

Mitigation proposal

I can see that we have turned retries off for Apple devices due to a small amount of HW but maybe we can turn them on? We now have many more iPhones in the osx.1200.amd64.iphone.open queue. Maybe we can move onto there and turn on retries? This sounds like we should do it anyway? @steveisok what do you think?

Solution (?)

Proper solution is to figure out why the TCP connection fails but that will be costly.
We can also make the TCP tunnel logic smarter so that it understands the connection is flopping.
We can also consider getting the testResults.xml from stdout always but that will be quite problematic because we get the logs through a quite hacky way. All solutions seem quite costly.

premun · 2022-11-21T13:54:00Z

I don't think we should create "known issue" for this as it might cover potential problems. I think enabling retries is the best next step. I will add open a PR for that. I checked the telemetry and there are not many APP_CRASH results (just 3 for iPhones in the last week or so).

Regardless.. seems like we are not using the new iPhones though - there are 50 iPhones with no work - we should also enable tests for iPhones too probably, not just AppleTVs?

steveisok · 2022-11-21T15:15:19Z

I don't think we should create "known issue" for this as it might cover potential problems. I think enabling retries is the best next step. I will add open a PR for that. I checked the telemetry and there are not many APP_CRASH results (just 3 for iPhones in the last week or so).

Regardless.. seems like we are not using the new iPhones though - there are 50 iPhones with no work - we should also enable tests for iPhones too probably, not just AppleTVs?

We can enable retries and see if they help. I thought we added iphones to the rolling build. I'll definitely do that if it's not currently.

premun · 2022-11-21T15:34:31Z

We have agreed that we could make XHarness recognize TCP issues and return some extra exit code and then have a known build error for that case.

jkotas · 2022-11-21T16:24:04Z

We have agreed that we could make XHarness recognize TCP issues and return some extra exit code and then have a known build error for that case.

+1

I believe that the logic you have added in #11689 is going to hide intermittent test crashes that is bad for getting a signal about the product reliability. Can it be deleted as part of fixing this?

premun · 2022-11-21T16:49:24Z

I believe that the logic you have added in #11689 is going to hide intermittent test crashes that is bad for getting a signal about the product reliability. Can it be deleted as part of fixing this?

Yes, this is a good point. Let's start distinguishing TCP-caused failures and let the other blow up.
(there have been 12 APP_CRASH results in the last 3 days btw)

I will include this in the issue that I will open for the TCP exit code.

premun · 2022-11-21T17:30:47Z

Removing Known Build Error as this log line might also appear on runs with unrelated issues.

premun · 2022-11-22T14:20:31Z

I have logged #11700 with enough details for someone to try to fix this

premun · 2023-01-23T09:29:08Z

Fixes for this were added through #11700. Keeping this open and will monitor the telemetry through out the week.

premun · 2023-01-27T10:13:45Z

Seems like the new fixes are already working! There are the TCP problems being detected:

But what is more important is that the retries happening for these cases now dropped the overall failure rate of AppleTV jobs to 0:

cc @AlitzelMendez @ilyas1974

steveisok · 2023-02-16T15:43:59Z

@premun it looks like your retry logic did mitigate a bunch, but there are still some getting through.

kotlarmilos · 2023-02-16T16:02:42Z

@premun I've been hitting it for a coupe of days in dotnet/runtime#81319. I will test the CI from a dummy PR with the changes and make sure it is not related.

premun · 2023-02-17T12:14:23Z

@kotlarmilos seems like the AzDO builds of your PR got deleted as a new ones got queued by new commits. I tried to find a failing leg in the commit history of your PR but no luck. Could you point me to one please? Can be just a failing Helix job.

premun · 2023-02-17T12:58:01Z

I see a large increase in iOS device failures in the last 10 days:

steveisok · 2023-02-17T13:06:26Z

@premun here's an example build

https://github.com/dotnet/runtime/runs/11396687519

kotlarmilos · 2023-02-17T14:13:00Z

@premun @steveisok AppleApp.targets had an issue not including an additional assembly during the AOT compilation, causing a runtime initialization failure on a device. It seems that when a runtime initialization fail the helix reports the TCP tunnel error, instead of an error log.

If the increase in iOS device failures are related to dotnet/runtime#81319 only, this issue could be closed.

Thanks for the support!

steveisok · 2023-02-17T14:15:52Z

@kotlarmilos there are still issues with the tcp tunnel even though your PR may have contributed to the spike of failures.

premun · 2023-03-31T12:54:56Z

Okay, so we've now got a TCP_CONNECTION_FAILED categorized separately and it's the largest amount of errors happening. Maybe we should now mark this as known test issue and ignore those?

steveisok · 2023-03-31T13:00:24Z

@premun I agree. However, I am noticing an odd pattern if you look at recent builds. When you dive into this log, notice how many retries there are:

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-83380-merge-a5362ecb5f8d43b98d/System.Text.Json.Tests.Attempt.3/1/console.774bab3d.log?helixlogtype=result

And if you look at the net.dot from any of the failures, they'll appear to be running and abruptly cut off as a "runtime crash". I don't think that's accurate and it appears that helix is cutting the run short.

https://dev.azure.com/dnceng-public/public/_build/results?buildId=218047&view=ms.vss-test-web.build-test-results-tab&runId=4079086&resultId=156355&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

premun · 2023-03-31T13:30:53Z

When you dive into this log, notice how many retries there are

Which retries do you mean?

Overall I am seeing an improvement but I understand we're now moving queues and might be running a reduced number of workloads?

premun · 2023-03-31T13:32:13Z

I think the crashes are expected - the app never connects over TCP so starts running without it but XHarness hits the launch timeout and kills the app because it doesn't know what's going on.

[05:56:42.1359790] 2023-03-27 02:56:42.237 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.Comparer_ComparerCreate
[05:56:42.1382500] 2023-03-27 02:56:42.239 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.EqualityComparer_EqualityComparerDefault
[05:56:42.1552990] Xamarin.Hosting: Process '30144' exited with exit code 1 or crashing signal .
[05:56:42.1557690] Application 'net.dot.System.Collections.Tests' terminated (with exit code '1' and/or crashing signal ').

kotlarmilos · 2023-04-11T18:33:37Z

I think the crashes are expected - the app never connects over TCP so starts running without it but XHarness hits the launch timeout and kills the app because it doesn't know what's going on.

In the build https://github.com/dotnet/runtime/pull/79169/checks?check_run_id=12668103836 the tests failed due to a problem with establishing a TCP connection. As a result, XHarness terminated the tests execution.

In the build https://github.com/dotnet/runtime/pull/84304/checks?check_run_id=12661274730 the tests may have failed during the startup due to missing runtime invoke wrappers. Also, it may have failed due to a problem with establishing a TCP connection. As a result, the issue was reported as a problem with establishing a TCP connection. Since the issue is not reproducible locally, it is currently unclear what the next steps should be.

@steveisok @premun I am considering some ideas for improving the overall experience. Here are some thoughts:

I don't understand why the TCP connection seems to be unstable
XHarness could print a stack trace and error message when an error occurs for better diagnostics
It might be possible that multiple TCP connection attempts being made within a short period of time could overload a device
If an app never connects over TCP, XHarness shouldn't execute tests to avoid timeouts

premun · 2023-04-12T09:05:04Z

I don't understand why the TCP connection seems to be unstable
XHarness could print a stack trace and error message when an error occurs for better diagnostics
It might be possible that multiple TCP connection attempts being made within a short period of time could overload a device

It's actually not XHarness doing the TCP but mlaunch - a tool used by XHarness and VS to talk to Apple devices/simulators. We didn't use to have this many issues with TCP but it's possible something regressed between mlaunch and new MacOS versions.

If an app never connects over TCP, XHarness shouldn't execute tests to avoid timeouts

This happens in the TestRunner - https://github.com/dotnet/xharness/blob/389c851b0dc1d2c50d03e4aad000b7802d0ebed6/src/Microsoft.DotNet.XHarness.TestRunners.Common/iOSApplicationEntryPointBase.cs#L26

kotlarmilos · 2023-04-13T08:55:04Z

Thank you for the explanation. Here are some additional points that may be worth considering.

I don't understand why the TCP connection seems to be unstable

XHarness could print a stack trace and error message when an error occurs for better diagnostics

It might be possible that multiple TCP connection attempts being made within a short period of time could overload a device

It's actually not XHarness doing the TCP but mlaunch - a tool used by XHarness and VS to talk to Apple devices/simulators. We didn't use to have this many issues with TCP but it's possible something regressed between mlaunch and new MacOS versions.

I've seen a number of TCP error logs, even on successfully executed tests where mlaunch connects to a device after N attempts. What would be a good way to measure occurrence of the connection attempts on the CI? We may update the pattern in this issue to search for TCP error logs, but it may collect only failed jobs.

If an app never connects over TCP, XHarness shouldn't execute tests to avoid timeouts

This happens in the TestRunner - https://github.com/dotnet/xharness/blob/389c851b0dc1d2c50d03e4aad000b7802d0ebed6/src/Microsoft.DotNet.XHarness.TestRunners.Common/iOSApplicationEntryPointBase.cs#L26

Probably I didn't fully understand your comment at first. Are you saying that if a TCP connection is not established, XHarness will not launch the app but instead hit a timeout? According to the logs, it seems that the app had started, and during its execution a timeout occured.

I think the crashes are expected - the app never connects over TCP so starts running without it but XHarness hits the launch timeout and kills the app because it doesn't know what's going on.

[05:56:42.1359790] 2023-03-27 02:56:42.237 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.Comparer_ComparerCreate
[05:56:42.1382500] 2023-03-27 02:56:42.239 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.EqualityComparer_EqualityComparerDefault
[05:56:42.1552990] Xamarin.Hosting: Process '30144' exited with exit code 1 or crashing signal .
[05:56:42.1557690] Application 'net.dot.System.Collections.Tests' terminated (with exit code '1' and/or crashing signal ').

Based on the failures discovered in dotnet/runtime#84304, it was observed that the error log of an application that might fails during startup is being masked by TCP connection failures. To proceed working on such improvements and improve overall CI test execution on ios platforms, I recommend considering the following action points:

Create a sample PR that intentionally breaks the test application during startup, and update the CI to display the corresponding error logs instead of TCP failure
Investigate why a TCP connection over mlaunch is unable to connect to a device on the first attempt in most cases; current assumption that it rarely/never connects on the first attempt
Confirm that the XHarness don't to run an application if a TCP connection failure occurs

Do you have any other ideas on what could be improved or do you think some of the action points listed are unnecessary?

I suggest updating the tracking issue dotnet/runtime#84254 with some of the above-mentioned action points.

premun · 2023-04-13T09:13:11Z

Probably I didn't fully understand your comment at first. Are you saying that if a TCP connection is not established, XHarness will not launch the app but instead hit a timeout? According to the logs, it seems that the app had started, and during its execution a timeout occured.

It happens in this order:

XHarness starts, checks the app
Finds a device (via mlaunch)
Installs app on the device (via mlaunch)
In the background, starts another instance of mlaunch to create the TCP tunnel between MacOS and the device
Starts a TCP listener on the port it gave mlaunch in the previous step
Waits for a signal log line from the TCP mlaunch (step 4) and when it says "TCP tunnel waiting for connections", proceeds next.
Starts the app on the phone (also mlaunch)

Now.. XHarness doesn't have visibility into the device so it doesn't know if the app started well. The app can now crash for instance so it never connects. The app also might fail to connect to the TCP tunnel. For XHarness it's all the same. If it doesn't connect in a specific time (argument --launch-timeout), it times out.

Now from the point of view of the app, the app can start and as argument (or envvar) it receives a port where the tunnel should be. If it can't connect there, it will log in it's stdout instead of the TCP tunnel (this is the TestRunner link I sent).

There's also this diagram for this - #11700

To me, it seems, that in the last few months, mlaunch opens the tunnel but then has issues keeping it open. If you check the logs, it flops from "awaiting connection on port XY" to "failed to open the tunnel" again. We try to detect this state with a recent change and categorize this as TCP failure.

Create a sample PR that intentionally breaks the test application during startup, and update the CI to display the corresponding > error logs instead of TCP failure

When making the last change, we did give the app a wrong port purposefully to test the TCP tunnel not being there.

Confirm that the XHarness don't to run an application if a TCP connection failure occurs

The problem is that the TCP tunnel starts in a while and then things can work fine. Example:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-arcade-refs-heads-release-60-ac03203fd7924eddad/zipped-apps/1/console.3dbd4769.log?helixlogtype=result

These are the main things:


### APP INSTALLATION

[15:46:40] dbug: Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app --list-crash-reports=/tmp/helix/working/B2D609CE/t/tmpcE9ZNU.tmp --devname DNCENGTVOS-026 -v -v -v -v -v
[15:46:40] dbug: Using Xcode 12.4 found in /Applications/Xcode124.app
[15:46:40] dbug: Xamarin.Hosting: Device discovery started
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:7e:84:2cfff)
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:82:ac:e7fff)
[15:46:40] dbug: Xamarin.Hosting: Connected to Vsm-xam-appletv-02 (fff44:c6:5d:7e:84:2cfff) in 00:00:00.0187767
[15:46:40] dbug: Xamarin.Hosting: Connected to Vsm-xam-appletv-01 (fff44:c6:5d:82:ac:e7fff) in 00:00:00.0178724
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b)
[15:46:40] dbug: Xamarin.Hosting: Connected to DNCENGTVOS-026 (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b) in 00:00:00.0014046
[15:46:41] dbug: Process mlaunch exited with 0

### STARTING TCP TUNNEL

[15:46:41] dbug: Starting TCP tunnel between mac port: 55899 and device port 55899
[15:46:41] dbug: [TCP tunnel]
[15:46:41] dbug: [TCP tunnel] Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app --tcp-tunnel=55899:55899 --devname 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b -v -v -v -v -v
[15:46:41] dbug: [TCP tunnel] Using Xcode 12.4 found in /Applications/Xcode124.app
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery started
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:7e:84:2cfff)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:82:ac:e7fff)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to Vsm-xam-appletv-02 (fff44:c6:5d:7e:84:2cfff) in 00:00:00.0186957
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to Vsm-xam-appletv-01 (fff44:c6:5d:82:ac:e7fff) in 00:00:00.0155840
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to DNCENGTVOS-026 (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b) in 00:00:00.0011462
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Creating USB tunnel between the port 55899 on the device and the port 55899 on the mac.
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61

...

[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Please connect the device 'Tcp tunnel started on device port 55899 to host port 55899 for device 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b'...
[15:46:46] dbug: TCP tunnel created on port 55899

### WE SAW THAT TCP TUNNEL STARTED WELL
### STARTING THE APP


[15:46:46] dbug: Starting the application
[15:46:46] dbug: 
[15:46:46] dbug: Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app -setenv=NUNIT_AUTOEXIT=true -setenv=NUNIT_HOSTPORT=55899 -setenv=NUNIT_ENABLE_XML_OUTPUT=true -setenv=NUNIT_XML_VERSION=xUnit -setenv=NUNIT_HOSTNAME=10.105.153.129,169.254.177.29,fe80::1%1,fe80::14ee:7dbb:3d47:cae1%4,2001:4898:9008:60:10a3:9ee6:c47a:ea01,2001:4898:9008:60:6131:89e0:ab6b:c9ac,2001:4898:9008:60:6071:4e8:3e9b:5ec1,2001:4898:9008:60:f449:e1d5:b345:31d6,2001:4898:9008:60:5dee:4dd2:58f:174,2001:4898:9008:60:bd42:319b:cd83:da0d,2001:4898:9008:60:4535:ba78:97ae:15cb,fe80::aede:48ff:fe00:1122%5,fe80::4f7:69d5:88c2:9f52%6,fe80::dcaf:aaff:fe16:42a1%10,fe80::dcaf:aaff:fe16:42a1%11,fe80::51a3:d95b:3171:8da0%17,fe80::a003:cdd2:b0b5:f493%18 --disable-memory-limits --devname 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b -setenv=USE_TCP_TUNNEL=true --launchdevbundleid net.dot.System.Buffers.Tests --wait-for-exit -v -v -v -v -v


### NOW TCP TUNNEL BREAKS AGAIN

[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
...

### NOW IT WORKS AGAIN AND APP CONNECTS ACTUALLY

[15:46:50] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 0
[15:46:50] dbug: Test log server listening on: localhost:55899

###

[15:46:50] dbug: [TCP tunnel] Xamarin.Hosting: Created USB tunnel between the port 55899 on the device and the port 55899 on the mac.
[15:46:50] dbug: Connection from 127.0.0.1:55899 saving logs to /tmp/helix/working/B2D609CE/w/AA2A0964/uploads/test-tvos-device-20230411_154640.log
[15:46:50] dbug: Test execution started

### THE TEST FINISHES FINE AFTER THIS

Something must have changed in mlaunch or in MacOS but we didn't see this behaviour ~6 months ago and running device tests was actually super smooth. I am happy to meet in the office and we can talk about this in person as it's quite complicated.

kotlarmilos · 2023-04-13T09:54:05Z

@premun Good idea! Let's synchronize offline and then post an update here.

kotlarmilos · 2023-04-28T16:02:56Z

The regression that occurred at the beginning of February 2023 appears to have affected iOS and tvOS devices. According to the telemetry data, no change was found in the OS version on the host machines.

In the past month, there were 13 TCP failures on osx.1200.amd64.iphone.open, and 138 TCP failures on osx.1015.amd64.iphone.open. Additionally, there are four times more devices in osx.1200.amd64.iphone.open than in osx.1015.amd64.iphone.open. The findings for tvOS devices are different, with uniform distribution of the TCP failure and no pattern detected.

Queue osx.1200.amd64.iphone.open is used on rolling builds (fewer invocations), but more tests are executed. One solution could be to exclude the osx.1015.amd64.iphone.open queue to see if the error rate decreases.

Anyway, these findings suggest that there may be an issue with the OS/Xcode or runtime network stack. To investigate this further and potentially repro the issue, it may be helpful to run an iOS app using the same mlaunch flow that Helix is using to run it, and collect the output over TCP locally.

steveisok · 2023-04-28T16:24:38Z

Anyway, these findings suggest that there may be an issue with the OS/Xcode or runtime network stack. To investigate this further and potentially repro the issue, it may be helpful to run an iOS app using the same mlaunch flow that Helix is using to run it, and collect the output over TCP locally.

When this was done in the past, no one could reproduce the issue locally. @akoeplinger suspect's the tcp session is done over wifi as opposed to usb-max. I don't think that's something we've done locally and is worthy of a try.

kotlarmilos · 2023-04-28T17:02:18Z

Thank you. Do you think that the following is something worthy of a try?

Queue osx.1200.amd64.iphone.open is used on rolling builds (fewer invocations), but more tests are executed. One solution could be to exclude the osx.1015.amd64.iphone.open queue to see if the error rate decreases.

steveisok · 2023-04-28T17:36:49Z

Thank you. Do you think that the following is something worthy of a try?

We've tried variations of this with the same result, so no tbh. I think, as you pointed out, it's an issue w/ mlaunch/xcode/osx or something within runtime. I don't believe it's the latter.

ilyas1974 · 2023-05-25T14:04:12Z

@steveisok just wanted to see if there was any update on this item.

premun · 2023-06-19T13:24:20Z

This is now being tracked in dotnet/runtime#82637 as a known error

jkotas added Ops - First Responder Detected By - Customer Issue was reported by a customer Known Build Error labels Nov 20, 2022

premun self-assigned this Nov 21, 2022

This was referenced Nov 21, 2022

Enable retries for APP_CRASH results on iOS/AppleTV devices #11689

Merged

Re-enable retries for APP_CRASH results on iOS/AppleTV devices #11690

Closed

premun removed the Known Build Error label Nov 21, 2022

steveisok mentioned this issue Nov 21, 2022

iOS - APP_LAUNCH_TIMEOUT / WORKLOAD TIMED OUT #10820

Closed

premun mentioned this issue Dec 5, 2022

Handle faulty TCP connection between XHarness and Apple devices #11700

Closed

5 tasks

premun closed this as completed Jan 27, 2023

steveisok reopened this Feb 16, 2023

steveisok mentioned this issue Feb 16, 2023

[iOS] Disable additional sourcegenerator tests dotnet/runtime#82184

Merged

This was referenced Feb 21, 2023

Tracking issue for CI build failure "TCP_CONNECTION_FAILED" dotnet/runtime#82083

Closed

[mono][aot] Enable dedup on tvOS device + arm64 simulators dotnet/runtime#82403

Merged

Adding HelloiOS sample for NativeAOT dotnet/runtime#82249

Merged

lewing mentioned this issue Feb 24, 2023

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono dotnet/runtime#82637

Open

premun mentioned this issue Feb 27, 2023

Categorize crash as TCP problem when connection never happens dotnet/xharness#999

Merged

vcsjones mentioned this issue Mar 3, 2023

Support raw ECDH key agreements dotnet/runtime#82551

Merged

kotlarmilos mentioned this issue Apr 13, 2023

[mono][aot] Further optimizations for dedup improvement dotnet/runtime#83973

Closed

3 tasks

premun assigned ilyas1974 and unassigned premun May 5, 2023

This was referenced May 29, 2023

Add PipeOptions.FirstPipeInstance enum value dotnet/runtime#83936

Merged

Read StartDiskNumber as uint32 dotnet/runtime#83923

Merged

premun closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iOS] Infrastructure is failing to report successful test run #11683

[iOS] Infrastructure is failing to report successful test run #11683

jkotas commented Nov 20, 2022 •

edited by premun

Loading

jkotas commented Nov 21, 2022

premun commented Nov 21, 2022 •

edited

Loading

premun commented Nov 21, 2022

steveisok commented Nov 21, 2022

premun commented Nov 21, 2022

jkotas commented Nov 21, 2022 •

edited

Loading

premun commented Nov 21, 2022

premun commented Nov 21, 2022

premun commented Nov 22, 2022

premun commented Jan 23, 2023

premun commented Jan 27, 2023 •

edited

Loading

steveisok commented Feb 16, 2023

kotlarmilos commented Feb 16, 2023

premun commented Feb 17, 2023

premun commented Feb 17, 2023

steveisok commented Feb 17, 2023

kotlarmilos commented Feb 17, 2023

steveisok commented Feb 17, 2023

premun commented Mar 31, 2023

steveisok commented Mar 31, 2023

premun commented Mar 31, 2023

premun commented Mar 31, 2023

kotlarmilos commented Apr 11, 2023 •

edited

Loading

premun commented Apr 12, 2023

kotlarmilos commented Apr 13, 2023

premun commented Apr 13, 2023 •

edited

Loading

kotlarmilos commented Apr 13, 2023

kotlarmilos commented Apr 28, 2023

steveisok commented Apr 28, 2023

kotlarmilos commented Apr 28, 2023

steveisok commented Apr 28, 2023

ilyas1974 commented May 25, 2023

premun commented Jun 19, 2023

[iOS] Infrastructure is failing to report successful test run #11683

[iOS] Infrastructure is failing to report successful test run #11683

Comments

jkotas commented Nov 20, 2022 • edited by premun Loading

Build

Build leg reported

Pull Request

Action required for the engineering services team

Additional information about the issue reported

Report

Summary

jkotas commented Nov 21, 2022

premun commented Nov 21, 2022 • edited Loading

Analysis

Mitigation proposal

Solution (?)

premun commented Nov 21, 2022

steveisok commented Nov 21, 2022

premun commented Nov 21, 2022

jkotas commented Nov 21, 2022 • edited Loading

premun commented Nov 21, 2022

premun commented Nov 21, 2022

premun commented Nov 22, 2022

premun commented Jan 23, 2023

premun commented Jan 27, 2023 • edited Loading

steveisok commented Feb 16, 2023

kotlarmilos commented Feb 16, 2023

premun commented Feb 17, 2023

premun commented Feb 17, 2023

steveisok commented Feb 17, 2023

kotlarmilos commented Feb 17, 2023

steveisok commented Feb 17, 2023

premun commented Mar 31, 2023

steveisok commented Mar 31, 2023

premun commented Mar 31, 2023

premun commented Mar 31, 2023

kotlarmilos commented Apr 11, 2023 • edited Loading

premun commented Apr 12, 2023

kotlarmilos commented Apr 13, 2023

premun commented Apr 13, 2023 • edited Loading

kotlarmilos commented Apr 13, 2023

kotlarmilos commented Apr 28, 2023

steveisok commented Apr 28, 2023

kotlarmilos commented Apr 28, 2023

steveisok commented Apr 28, 2023

ilyas1974 commented May 25, 2023

premun commented Jun 19, 2023

jkotas commented Nov 20, 2022 •

edited by premun

Loading

premun commented Nov 21, 2022 •

edited

Loading

jkotas commented Nov 21, 2022 •

edited

Loading

premun commented Jan 27, 2023 •

edited

Loading

kotlarmilos commented Apr 11, 2023 •

edited

Loading

premun commented Apr 13, 2023 •

edited

Loading