Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete inflight requests after SIGTERM #1508

Open
sparkacus opened this issue Feb 19, 2025 · 5 comments · May be fixed by #1529
Open

Complete inflight requests after SIGTERM #1508

sparkacus opened this issue Feb 19, 2025 · 5 comments · May be fixed by #1529
Assignees
Labels
go Pull requests that update Go code

Comments

@sparkacus
Copy link

Unless I'm mistaken, when a SIGTERM is recieved gofr immediately exits closing all connections.

Would it be possible to gracefully handle the shutdown?

Off the top of my head the following might be useful, especially for those of us who plan to scale up/down:

  • Ability to set a threshold (seconds) where gofr will still accept new connections after a sigterm
  • Ability to set a threshold (seconds) where gofr will continue to handle inflight requests after the threshold for new connections (above)

Traefik's lifecycle procedure might be a good example of how it could be handled: https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle

@Umang01-hash Umang01-hash added the go Pull requests that update Go code label Feb 20, 2025
@Umang01-hash
Copy link
Member

@sparkacus GoFr has a graceful shutdown implemented, we added it using this PR: #837

And regarding ability to set thresholds we can discuss about it.

@sparkacus
Copy link
Author

sparkacus commented Feb 21, 2025

@Umang01-hash You're right, apologies.

I'm using the 'air' package for hot reloading, so a CTRL+C resulted in the process being killed instantly, but that's not gofrs doing.... sorry.

I notice that the shutdown timeout is hardcoded to 30s - It might be useful to allow us to override that via a config i.e. I might want to lower or increase the value. An example could be when using SPOT instances on AWS, from my understanding they'll kill the instance regardless 30s after SIGTERM, so we might want to lower it to ensure gofr can completly shutdown gracefully, which is useful for observability purposes.

EDIT: I notice that gofr does have a REQUEST_TIMEOUT config which would likely resolve the example I provided (I'm certainly not advocating that requests should last longer than 30s :), but sometimes there are reasons)

Concerning the thresholds, it certainly would be useful, especially if gofr is targeted directly from an LB (It might not be a kubernetes service which will remove the target from the endpointslices). Ideally when a SIGTERM is recieved it would immdiately fail the healthcheck endpoint, but being able to handle new requests for a period of time gives the LB an opertunity to detect the failed endpoint and remove it as a target, but at least those new connections will be served.

@vikash
Copy link
Contributor

vikash commented Feb 24, 2025

health check and liveness probe should fail after SIGTERM is received, even if the server is not currently closed, it should be able to finish all in-flight requests.

SERVER_SHUTDOWN_GRACE_PERIOD or something like this can be added as well - current default of 30 second can be the default to ensure no changes in existing behaviour even after the change.

@Umang01-hash Umang01-hash self-assigned this Feb 25, 2025
@Umang01-hash Umang01-hash linked a pull request Feb 26, 2025 that will close this issue
4 tasks
@sparkacus
Copy link
Author

sparkacus commented Feb 26, 2025

health check and liveness probe should fail after SIGTERM is received, even if the server is not currently closed, it should be able to finish all in-flight requests.

SERVER_SHUTDOWN_GRACE_PERIOD or something like this can be added as well - current default of 30 second can be the default to ensure no changes in existing behaviour even after the change.

Yes, having the ability to tweak the shutdown period could be very useful. It may be that 30 seconds is too long, so it will help us optimise based on various circumstances.

Concerning the healthcheck/liveness probe, you're right, at the point of SIGTERM it should remove the IP from the endpoint slice. Kubernetes might have been a bad example because it handles these situations well.

Lets use another example. The gofr service is say hosted on an instance that is sitting behind a traditional LB e.g. AWS ALB... those target groups are based on interval health checks, so if the gofr service shuts down (lets assume not due to the instance being stopped), the ALB will still forward the request to the service which is now technically offline (it's in the graceful shutdown mode) and thus the request will fail. That target will only be removed once whatever healthcheck thresholds have been reached.

If gofr can still handle incoming requests during the graceful shutdown but return a non-200 response to the healthchecks, that should negate request failures. It also gives the administrator peace of mind that they can stop the service without cuasing an outage (although technically if they knew they were going to stop the service they'd remove the target first... but still)

@Umang01-hash
Copy link
Member

Umang01-hash commented Feb 27, 2025

Hi @sparkacus,

While the PR (#1529) introducing the new SERVER_SHUTDOWN_THRESHOLD config, which indeed adds flexibility to the graceful shutdown process.

Regarding the health check behavior, our current design follows the traditional approach: once SIGTERM is received, the server stops accepting new connections (leading to errors like ECONNREFUSED) while still processing inflight requests. This method clearly signals to load balancers that the instance is going out of service, and modern load balancers (e.g., AWS ALB) are typically tuned to remove targets quickly based on these immediate health check failures.

In AWS environments that use autoscaling groups, this behavior fits well into deployment strategies like blue-green or rolling deployments. Typically, a new instance with the updated version is launched, and once it’s healthy, the old instance is automatically deregistered and terminated. This process ensures smooth transitions without needing to keep the health check endpoint reachable during shutdown.

For those who might need an alternative, the recommended approach is to adjust the load balancer or deployment configuration to fit their specific requirements.

Thanks.

@gofr-dev gofr-dev deleted a comment from H-Zop Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants