-
-
Notifications
You must be signed in to change notification settings - Fork 5
ci: increase e2e startup polling delay #3483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3483 +/- ##
==========================================
+ Coverage 82.21% 82.22% +0.01%
==========================================
Files 615 615
Lines 37031 37031
Branches 6044 6063 +19
==========================================
+ Hits 30444 30448 +4
+ Misses 5703 5689 -14
- Partials 884 894 +10 ☔ View full report in Codecov by Sentry. |
808c876
to
d0a9b97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmachapman reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)
src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts
line 7 at r1 (raw file):
const pollUrl = 'http://localhost:5000/projects'; const pollInterval = 17000;
Wouldn't it be better to poll every second, but reduce the error logging to only log those failures that occur after the 30 second mark? This will help reduce the error messages without a non-standard interval number (i.e. when some looks at 17000 in future, they won't realize it is based on your PC's and the GHA's performance).
Including an exponential backoff for the interval, i.e. 1000, 2000, 4000, 8000, 16000, is also a usual practice for polling non-responsive services.
Code quote:
const pollInterval = 17000;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @marksvc)
-- commits
line 4 at r1:
I don't see what's wrong with the current approach. I can't think of a reason to run this locally, and usually if you're looking at it on GHA, it's because the e2e tests are broken, and we're going to have to re-run them. This will often add an unnecessary delay to running the tests, which we'd ideally like to optimize to run much faster.
0:00 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111) 0:01 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111) 0:02 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111) 0:03 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111) 0:04 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111) 33 times. Once when I was still learning and checked the e2e logs early on I incorrectly thought the e2e tests never even successfully started because of all these messages. The little message "Startup check passed. Exiting." at the end is a squeak compared to all the errors :) The other problem is seeing it on my workstation, which is more often. On my local computer I start stuff all the time: compile job, launch a program, start some tests. It will give an indication of starting up, and eventually it's going. But it doesn't say "Error" a dozen times before successfully starting up :-) It would be an improvement if every second we printed a message saying "Still waiting" 33 times.
I often run it locally. If the GHA job fails, I might be able to more accurately try what it is doing by running the same script.
I know the e2e.mts script has various configuration etc, but it's been very convenient to just run the bash script that handles it all.
Can you clarify why this might mean the display of 33 error messages at the beginning of the test run isn't undesirable?
Yes, I see what you're saying here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)
-- commits
line 4 at r1:
It seems like the issue has more to do with how the startup ping failures appear in the log, than their frequency or presence. Perhaps the message could be changed to just say it's not up yet. It's verbose because it's intended to run on the CI.
The canonical way of running the tests, defined in the README, is to run e2e.mts
. pre_merge_ci.sh
is intended as a wrapper around that for the sake of a CI. I'm not even sure if developers on Windows can run it.
- It does stuff in the GUI and sometimes I am concerned that I might have messed it up with my mouse or keyboard.
This is a good argument for adding a new preset called headless
that's identical to the default, except for being headless. I usually run them in headed mode because I want to be able to easily investigate failures, and headless mode makes it nearly impossible.
- I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.
I nearly always have SF running for development purposes, and shutting down the processes to start another is quite slow. Do you not have it running most of the time? I guess I've just assumed it would be difficult to work without SF running.
Can you clarify why this might mean the display of 33 error messages at the beginning of the test run isn't undesirable?
I see them as startup status messages, in a CI script that's intended to be as verbose as possible. Arguably maybe the error messages could be toned down to just say that it can't connect yet instead of the underlying error message.
src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts
line 7 at r1 (raw file):
I don't really want an exponential backoff, since this isn't about a service failure, but just waiting for a service to come up. And the cost of the requests is very low, since it's all on localhost.
reduce the error logging to only log those failures that occur after the 30 second
Taking more than 30 seconds wouldn't be particularly abnormal. In my opinion showing network failure messages after an arbitrary time period would make the problem of status messages looking like critical failures worse, since they normally wouldn't show up, so when they do show up it would deviate from what's normal.
d0a9b97
to
7a48679
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)
- I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.
I nearly always have SF running for development purposes, and shutting down the processes to start another is quite slow. Do you not have it running most of the time? I guess I've just assumed it would be difficult to work without SF running.
I'm sorry, what I gave wasn't really a fair point, since using pre_merge_ci.sh
has the same coordination issue, which is that I need to make sure SF is not running before I can run pre_merge_ci.sh
, or it will have trouble.
I often have SF running. But depending on what I'm doing, I may or may not be keeping SF running or turned off.
maybe the error messages could be toned down to just say that it can't connect yet instead of the underlying error message.
What do you think of this change? On my workstation I see the following now, where a .
is shown each second until a threshold:
E2ELauncher: Awaiting application startup before running tests
............
await-application-startup: 0:27 Startup check passed. Exiting.
If a more explicit "waiting" message would be preferable, we could instead do something like the following, if that would be preferable? (i.e. showing Waiting.
each second until a threshold.)
E2ELauncher: Awaiting application startup before running tests
Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting.
await-application-startup: 0:27 Startup check passed. Exiting.
src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts
line 7 at r1 (raw file):
In my opinion showing network failure messages after an arbitrary time period would make the problem of status messages looking like critical failures worse, since they normally wouldn't show up, so when they do show up it would deviate from what's normal.
Hmmm. I hear what you're saying. Perhaps another way to think about it is "now something in our job is taking longer than normal / than it used to."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmachapman reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @marksvc)
src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts
line 27 at r2 (raw file):
const minutes = Math.floor(elapsed / 60_000); const seconds = Math.floor((elapsed % 60_000) / 1000); return `${minutes}:${seconds < 10 ? "0" + seconds : seconds}`;
NIT: You can use padStart
:
return `${minutes}:${seconds.toString().padStart(2, "0")}`;
Code quote:
return `${minutes}:${seconds < 10 ? "0" + seconds : seconds}`;
7a48679
to
0ebda7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @pmachapman)
src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts
line 27 at r2 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
NIT: You can use
padStart
:return `${minutes}:${seconds.toString().padStart(2, "0")}`;
Thank you. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @pmachapman)
src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts
line 9 at r2 (raw file):
const pollUrl = "http://localhost:5000/projects"; const pollInterval = 1000; const quietPeriodSec = 50_000;
I don't see the point of having a quiet period, followed by a non-quiet period, when it eventually times out. Why not keep it quiet until the script gives up and times out? Also, I think the variable is misnamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmachapman reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @marksvc and @Nateowami)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @marksvc)
Previously, marksvc wrote…
- I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.
I nearly always have SF running for development purposes, and shutting down the processes to start another is quite slow. Do you not have it running most of the time? I guess I've just assumed it would be difficult to work without SF running.
I'm sorry, what I gave wasn't really a fair point, since using
pre_merge_ci.sh
has the same coordination issue, which is that I need to make sure SF is not running before I can runpre_merge_ci.sh
, or it will have trouble.I often have SF running. But depending on what I'm doing, I may or may not be keeping SF running or turned off.
maybe the error messages could be toned down to just say that it can't connect yet instead of the underlying error message.
What do you think of this change? On my workstation I see the following now, where a
.
is shown each second until a threshold:E2ELauncher: Awaiting application startup before running tests ............ await-application-startup: 0:27 Startup check passed. Exiting.If a more explicit "waiting" message would be preferable, we could instead do something like the following, if that would be preferable? (i.e. showing
Waiting.
each second until a threshold.)E2ELauncher: Awaiting application startup before running tests Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. Waiting. await-application-startup: 0:27 Startup check passed. Exiting.
How about adding ...
after the Awaiting application startup before running tests
so it's a tad clearer that it's an ongoing thing. Then I think adding one dot each second is fine.
This hides a number of error messages that are really indications that we need to continue waiting.
0ebda7e
to
466f7ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Nateowami)
Previously, Nateowami wrote…
How about adding
...
after theAwaiting application startup before running tests
so it's a tad clearer that it's an ongoing thing. Then I think adding one dot each second is fine.
How's this?
src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts
line 9 at r2 (raw file):
Previously, Nateowami wrote…
I don't see the point of having a quiet period, followed by a non-quiet period, when it eventually times out. Why not keep it quiet until the script gives up and times out? Also, I think the variable is misnamed.
Thank you, that would have been milliseconds, wouldn't it :).
How's this? The failure output looks like:
E2ELauncher: 2025-10-17T15:30:00-06:00 Starting.
E2ELauncher: Logging dotnet output to ./ClientApp/e2e/test_output/ci_e2e_test_results/dotnet.txt
E2ELauncher: Server started with PID 2300229
await-application-startup: Awaiting application startup before running tests ...........
........................................................................................
........................................................................................
........................................................................................
............................
await-application-startup: Error: error sending request for url (http://localhost:5099/projects): client error (Connect): tcp connect error: Connection refused (os error 111)
await-application-startup: Failed to start in 300000 milliseconds. Exiting.
E2ELauncher: Elapsed time: 5m 0s
E2ELauncher: Shutting down server with PID 2300229.
E2ELauncher: Server shut down.
The success output looks like:
E2ELauncher: 2025-10-17T16:51:54-06:00 Starting.
E2ELauncher: Logging dotnet output to ./ClientApp/e2e/test_output/ci_e2e_test_results/do
tnet.txt
E2ELauncher: Server started with PID 2434441
await-application-startup: Awaiting application startup before running tests ...........
........
await-application-startup: 0:31 Startup check passed. Exiting.
[ "pre_merge_ci" ]
Using preset: pre_merge_ci
It bothers me to see all the error messages.
My workstation can start it in 15 s. GHA can start it in 30 s.
This change is