ci: increase e2e startup polling delay #3483

marksvc · 2025-09-30T22:08:58Z

It bothers me to see all the error messages.
My workstation can start it in 15 s. GHA can start it in 30 s.

This change is

codecov · 2025-09-30T22:17:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.22%. Comparing base (771e64c) to head (466f7ce).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3483      +/-   ##
==========================================
+ Coverage   82.21%   82.22%   +0.01%     
==========================================
  Files         615      615              
  Lines       37031    37031              
  Branches     6044     6063      +19     
==========================================
+ Hits        30444    30448       +4     
+ Misses       5703     5689      -14     
- Partials      884      894      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pmachapman

@pmachapman reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)

src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 7 at r1 (raw file):

const pollUrl = 'http://localhost:5000/projects';
const pollInterval = 17000;

Wouldn't it be better to poll every second, but reduce the error logging to only log those failures that occur after the 30 second mark? This will help reduce the error messages without a non-standard interval number (i.e. when some looks at 17000 in future, they won't realize it is based on your PC's and the GHA's performance).

Including an exponential backoff for the interval, i.e. 1000, 2000, 4000, 8000, 16000, is also a usual practice for polling non-responsive services.

Code quote:

const pollInterval = 17000;

Nateowami

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @marksvc)

-- commits line 4 at r1:
I don't see what's wrong with the current approach. I can't think of a reason to run this locally, and usually if you're looking at it on GHA, it's because the e2e tests are broken, and we're going to have to re-run them. This will often add an unnecessary delay to running the tests, which we'd ideally like to optimize to run much faster.

marksvc · 2025-10-01T22:24:32Z

-- commits line 4 at r1:
One problem is that when I go look at e2e logs, I see

0:00 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:01 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:02 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:03 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:04 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)
...

33 times. Once when I was still learning and checked the e2e logs early on I incorrectly thought the e2e tests never even successfully started because of all these messages. The little message "Startup check passed. Exiting." at the end is a squeak compared to all the errors :)

The other problem is seeing it on my workstation, which is more often. On my local computer I start stuff all the time: compile job, launch a program, start some tests. It will give an indication of starting up, and eventually it's going. But it doesn't say "Error" a dozen times before successfully starting up :-)

It would be an improvement if every second we printed a message saying "Still waiting" 33 times.
I like what Peter suggests: Try every second, but don't print the messages until enough time has passed. Then it could try every second for half a minute, but not say Error every time.

I can't think of a reason to run this locally,

I often run it locally. If the GHA job fails, I might be able to more accurately try what it is doing by running the same script.
Another reason I run it locally is that if I use e2e.mts, then

It does stuff in the GUI and sometimes I am concerned that I might have messed it up with my mouse or keyboard.
I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.

I know the e2e.mts script has various configuration etc, but it's been very convenient to just run the bash script that handles it all.

usually if you're looking at it on GHA, it's because the e2e tests are broken, and we're going to have to re-run them

Can you clarify why this might mean the display of 33 error messages at the beginning of the test run isn't undesirable?

This will often add an unnecessary delay to running the tests, which we'd ideally like to optimize to run much faster.

Yes, I see what you're saying here.

Nateowami

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)

-- commits line 4 at r1:
It seems like the issue has more to do with how the startup ping failures appear in the log, than their frequency or presence. Perhaps the message could be changed to just say it's not up yet. It's verbose because it's intended to run on the CI.

The canonical way of running the tests, defined in the README, is to run e2e.mts. pre_merge_ci.sh is intended as a wrapper around that for the sake of a CI. I'm not even sure if developers on Windows can run it.

It does stuff in the GUI and sometimes I am concerned that I might have messed it up with my mouse or keyboard.

This is a good argument for adding a new preset called headless that's identical to the default, except for being headless. I usually run them in headed mode because I want to be able to easily investigate failures, and headless mode makes it nearly impossible.

I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.

I nearly always have SF running for development purposes, and shutting down the processes to start another is quite slow. Do you not have it running most of the time? I guess I've just assumed it would be difficult to work without SF running.

Can you clarify why this might mean the display of 33 error messages at the beginning of the test run isn't undesirable?

I see them as startup status messages, in a CI script that's intended to be as verbose as possible. Arguably maybe the error messages could be toned down to just say that it can't connect yet instead of the underlying error message.

src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 7 at r1 (raw file):
I don't really want an exponential backoff, since this isn't about a service failure, but just waiting for a service to come up. And the cost of the requests is very low, since it's all on localhost.

reduce the error logging to only log those failures that occur after the 30 second

Taking more than 30 seconds wouldn't be particularly abnormal. In my opinion showing network failure messages after an arbitrary time period would make the problem of status messages looking like critical failures worse, since they normally wouldn't show up, so when they do show up it would deviate from what's normal.

marksvc

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)

-- commits line 4 at r1:

I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.

I nearly always have SF running for development purposes, and shutting down the processes to start another is quite slow. Do you not have it running most of the time? I guess I've just assumed it would be difficult to work without SF running.

I'm sorry, what I gave wasn't really a fair point, since using pre_merge_ci.sh has the same coordination issue, which is that I need to make sure SF is not running before I can run pre_merge_ci.sh, or it will have trouble.

I often have SF running. But depending on what I'm doing, I may or may not be keeping SF running or turned off.

maybe the error messages could be toned down to just say that it can't connect yet instead of the underlying error message.

What do you think of this change? On my workstation I see the following now, where a . is shown each second until a threshold:

E2ELauncher: Awaiting application startup before running tests
............
await-application-startup: 0:27 Startup check passed. Exiting.

If a more explicit "waiting" message would be preferable, we could instead do something like the following, if that would be preferable? (i.e. showing Waiting. each second until a threshold.)

E2ELauncher: Awaiting application startup before running tests
Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. 
await-application-startup: 0:27 Startup check passed. Exiting.

src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 7 at r1 (raw file):

In my opinion showing network failure messages after an arbitrary time period would make the problem of status messages looking like critical failures worse, since they normally wouldn't show up, so when they do show up it would deviate from what's normal.

Hmmm. I hear what you're saying. Perhaps another way to think about it is "now something in our job is taking longer than normal / than it used to."

pmachapman

@pmachapman reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @marksvc)

src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 27 at r2 (raw file):

  const minutes = Math.floor(elapsed / 60_000);
  const seconds = Math.floor((elapsed % 60_000) / 1000);
  return `${minutes}:${seconds < 10 ? "0" + seconds : seconds}`;

NIT: You can use padStart:

return `${minutes}:${seconds.toString().padStart(2, "0")}`;

Code quote:

return `${minutes}:${seconds < 10 ? "0" + seconds : seconds}`;

marksvc

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @pmachapman)

src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 27 at r2 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

NIT: You can use padStart:

return `${minutes}:${seconds.toString().padStart(2, "0")}`;

Thank you. Done.

Nateowami

Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @pmachapman)

src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 9 at r2 (raw file):

const pollUrl = "http://localhost:5000/projects";
const pollInterval = 1000;
const quietPeriodSec = 50_000;

I don't see the point of having a quiet period, followed by a non-quiet period, when it eventually times out. Why not keep it quiet until the script gives up and times out? Also, I think the variable is misnamed.

pmachapman

@pmachapman reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @marksvc and @Nateowami)

Nateowami

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @marksvc)

-- commits line 4 at r1:

Previously, marksvc wrote…

I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.

I nearly always have SF running for development purposes, and shutting down the processes to start another is quite slow. Do you not have it running most of the time? I guess I've just assumed it would be difficult to work without SF running.

I'm sorry, what I gave wasn't really a fair point, since using pre_merge_ci.sh has the same coordination issue, which is that I need to make sure SF is not running before I can run pre_merge_ci.sh, or it will have trouble.

I often have SF running. But depending on what I'm doing, I may or may not be keeping SF running or turned off.

maybe the error messages could be toned down to just say that it can't connect yet instead of the underlying error message.

What do you think of this change? On my workstation I see the following now, where a . is shown each second until a threshold:
E2ELauncher: Awaiting application startup before running tests
............
await-application-startup: 0:27 Startup check passed. Exiting.
If a more explicit "waiting" message would be preferable, we could instead do something like the following, if that would be preferable? (i.e. showing Waiting. each second until a threshold.)
E2ELauncher: Awaiting application startup before running tests
Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. 
await-application-startup: 0:27 Startup check passed. Exiting.

How about adding ... after the Awaiting application startup before running tests so it's a tad clearer that it's an ongoing thing. Then I think adding one dot each second is fine.

This hides a number of error messages that are really indications that we need to continue waiting.

marksvc

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Nateowami)

-- commits line 4 at r1:

Previously, Nateowami wrote…

How about adding ... after the Awaiting application startup before running tests so it's a tad clearer that it's an ongoing thing. Then I think adding one dot each second is fine.

How's this?

src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 9 at r2 (raw file):

Previously, Nateowami wrote…

I don't see the point of having a quiet period, followed by a non-quiet period, when it eventually times out. Why not keep it quiet until the script gives up and times out? Also, I think the variable is misnamed.

Thank you, that would have been milliseconds, wouldn't it :).

How's this? The failure output looks like:

E2ELauncher: 2025-10-17T15:30:00-06:00 Starting.
E2ELauncher: Logging dotnet output to ./ClientApp/e2e/test_output/ci_e2e_test_results/dotnet.txt
E2ELauncher: Server started with PID 2300229
await-application-startup: Awaiting application startup before running tests ...........
........................................................................................
........................................................................................
........................................................................................
............................
await-application-startup: Error: error sending request for url (http://localhost:5099/projects): client error (Connect): tcp connect error: Connection refused (os error 111)
await-application-startup: Failed to start in 300000 milliseconds. Exiting.
E2ELauncher: Elapsed time: 5m 0s
E2ELauncher: Shutting down server with PID 2300229.
E2ELauncher: Server shut down.

The success output looks like:

E2ELauncher: 2025-10-17T16:51:54-06:00 Starting.
E2ELauncher: Logging dotnet output to ./ClientApp/e2e/test_output/ci_e2e_test_results/do
tnet.txt                                                                                
E2ELauncher: Server started with PID 2434441                             
await-application-startup: Awaiting application startup before running tests ...........
........
await-application-startup: 0:31 Startup check passed. Exiting.
[ "pre_merge_ci" ]
Using preset: pre_merge_ci

marksvc added the testing not required label Sep 30, 2025

marksvc marked this pull request as ready for review September 30, 2025 22:17

pmachapman force-pushed the task/e2e-poll-delay branch from 808c876 to d0a9b97 Compare September 30, 2025 22:28

pmachapman self-assigned this Sep 30, 2025

pmachapman self-requested a review September 30, 2025 22:29

pmachapman requested changes Sep 30, 2025

View reviewed changes

Nateowami reviewed Sep 30, 2025

View reviewed changes

Nateowami reviewed Oct 3, 2025

View reviewed changes

marksvc force-pushed the task/e2e-poll-delay branch from d0a9b97 to 7a48679 Compare October 14, 2025 19:26

marksvc requested a review from pmachapman October 14, 2025 19:26

marksvc commented Oct 14, 2025

View reviewed changes

pmachapman reviewed Oct 14, 2025

View reviewed changes

marksvc force-pushed the task/e2e-poll-delay branch from 7a48679 to 0ebda7e Compare October 15, 2025 17:08

marksvc requested a review from pmachapman October 15, 2025 17:08

marksvc commented Oct 15, 2025

View reviewed changes

Nateowami reviewed Oct 15, 2025

View reviewed changes

pmachapman reviewed Oct 15, 2025

View reviewed changes

Nateowami reviewed Oct 16, 2025

View reviewed changes

ci: increase e2e startup polling delay

466f7ce

This hides a number of error messages that are really indications that we need to continue waiting.

marksvc force-pushed the task/e2e-poll-delay branch from 0ebda7e to 466f7ce Compare October 17, 2025 23:03

marksvc requested a review from Nateowami October 17, 2025 23:03

marksvc commented Oct 17, 2025

View reviewed changes

Uh oh!

ci: increase e2e startup polling delay #3483

Are you sure you want to change the base?

ci: increase e2e startup polling delay #3483

Uh oh!

Conversation

marksvc commented Sep 30, 2025 • edited by irahopkinson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

Nateowami left a comment

Choose a reason for hiding this comment

Uh oh!

marksvc commented Oct 1, 2025

Uh oh!

Nateowami left a comment

Choose a reason for hiding this comment

Uh oh!

marksvc left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

marksvc left a comment

Choose a reason for hiding this comment

Uh oh!

Nateowami left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

Nateowami left a comment

Choose a reason for hiding this comment

Uh oh!

marksvc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marksvc commented Sep 30, 2025 •

edited by irahopkinson

Loading

codecov bot commented Sep 30, 2025 •

edited

Loading