-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete inflight requests after SIGTERM #1508
Comments
@sparkacus GoFr has a graceful shutdown implemented, we added it using this PR: #837 And regarding ability to set thresholds we can discuss about it. |
@Umang01-hash You're right, apologies. I'm using the 'air' package for hot reloading, so a CTRL+C resulted in the process being killed instantly, but that's not gofrs doing.... sorry. I notice that the shutdown timeout is hardcoded to 30s - It might be useful to allow us to override that via a config i.e. I might want to lower or increase the value. An example could be when using SPOT instances on AWS, from my understanding they'll kill the instance regardless 30s after SIGTERM, so we might want to lower it to ensure gofr can completly shutdown gracefully, which is useful for observability purposes. EDIT: I notice that gofr does have a REQUEST_TIMEOUT config which would likely resolve the example I provided (I'm certainly not advocating that requests should last longer than 30s :), but sometimes there are reasons) Concerning the thresholds, it certainly would be useful, especially if gofr is targeted directly from an LB (It might not be a kubernetes service which will remove the target from the endpointslices). Ideally when a SIGTERM is recieved it would immdiately fail the healthcheck endpoint, but being able to handle new requests for a period of time gives the LB an opertunity to detect the failed endpoint and remove it as a target, but at least those new connections will be served. |
health check and liveness probe should fail after SIGTERM is received, even if the server is not currently closed, it should be able to finish all in-flight requests. SERVER_SHUTDOWN_GRACE_PERIOD or something like this can be added as well - current default of 30 second can be the default to ensure no changes in existing behaviour even after the change. |
Yes, having the ability to tweak the shutdown period could be very useful. It may be that 30 seconds is too long, so it will help us optimise based on various circumstances. Concerning the healthcheck/liveness probe, you're right, at the point of SIGTERM it should remove the IP from the endpoint slice. Kubernetes might have been a bad example because it handles these situations well. Lets use another example. The gofr service is say hosted on an instance that is sitting behind a traditional LB e.g. AWS ALB... those target groups are based on interval health checks, so if the gofr service shuts down (lets assume not due to the instance being stopped), the ALB will still forward the request to the service which is now technically offline (it's in the graceful shutdown mode) and thus the request will fail. That target will only be removed once whatever healthcheck thresholds have been reached. If gofr can still handle incoming requests during the graceful shutdown but return a non-200 response to the healthchecks, that should negate request failures. It also gives the administrator peace of mind that they can stop the service without cuasing an outage (although technically if they knew they were going to stop the service they'd remove the target first... but still) |
Hi @sparkacus, While the PR (#1529) introducing the new Regarding the health check behavior, our current design follows the traditional approach: once SIGTERM is received, the server stops accepting new connections (leading to errors like ECONNREFUSED) while still processing inflight requests. This method clearly signals to load balancers that the instance is going out of service, and modern load balancers (e.g., AWS ALB) are typically tuned to remove targets quickly based on these immediate health check failures. In AWS environments that use autoscaling groups, this behavior fits well into deployment strategies like blue-green or rolling deployments. Typically, a new instance with the updated version is launched, and once it’s healthy, the old instance is automatically deregistered and terminated. This process ensures smooth transitions without needing to keep the health check endpoint reachable during shutdown. For those who might need an alternative, the recommended approach is to adjust the load balancer or deployment configuration to fit their specific requirements. Thanks. |
Unless I'm mistaken, when a SIGTERM is recieved gofr immediately exits closing all connections.
Would it be possible to gracefully handle the shutdown?
Off the top of my head the following might be useful, especially for those of us who plan to scale up/down:
Traefik's lifecycle procedure might be a good example of how it could be handled: https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle
The text was updated successfully, but these errors were encountered: