Skip to content

Conversation

@gregfurman
Copy link

Motivation

Changes

  • Adds a waitForRuntimeAPI helper function that checks the /ping endpoint of the runtime API server before allowing the INIT functionality to proceed.

@gregfurman gregfurman self-assigned this Oct 17, 2025
@gregfurman gregfurman added the bug Something isn't working label Oct 17, 2025
@gregfurman
Copy link
Author

Context

tl;dr there's a race condition when trying to initialise the RIE before starting the Lambda Runtime API server. The solution is to block and poll the runtime server until we're certain it's up.

Problem

When building a Sandbox, the RIE automatically starts the runtime API server in the background.

  1. We create the sandbox via SandboxBuilder.Create() method:

// initialize all flows and start runtime API
sandboxContext, internalStateFn := sandbox.Create()

  1. This then triggers the rapid.Server to be initialised and started (allowing the RIE to accept /invocation/next calls among others):

rapidCtx, internalStateFn, runtimeAPIAddr := rapid.Start(ctx, b.sandbox)

  1. After initialising, Start actually launches the runtime API server in a seperate goroutine and returns. This, however, does not guarantuee that the server has actually started.

go startRuntimeAPI(ctx, execCtx)

  1. Our Init call then fails since there's a chance the runtime server has not yet started:

InitHandler(sandbox.LambdaInvokeAPI(), GetEnvOrDie("AWS_LAMBDA_FUNCTION_VERSION"), int64(invokeTimeoutSeconds), bootstrap, lsOpts.AccountId) // TODO: replace this with a custom init

Solution

Instead of relying on the good graces of the scheduler to start the LambdaRuntimeAPI server before trying to initialise, we block after the SandboxBuilder.Create() until the/ping handler (Hyrum's law proven correct once again) resolves.

// To respect Hyrum's Law, keeping /ping API even though
// we no longer use it ourselves.
// http://www.hyrumslaw.com/
router.Get("/ping", handler.NewPingHandler().ServeHTTP)

@gregfurman gregfurman force-pushed the fix/race-condition-12680 branch from 3cac6da to 66e356c Compare October 20, 2025 10:04
Copy link
Member

@dfangl dfangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I think this should resolve the issue, however I am a bit concerned about the possible increase in startup time. Let's discuss the timeout before merging!

Timeout: 5 * time.Second,
}

ticker := time.NewTicker(500 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you per chance test how many - on average - retries we get here, and how this affects the average startup times? Does it make sense to reduce this to 10 - 50ms, for example? More pings, but less delay? The startup time should be significantly less than 10ms ideally anyway, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfangl Fair points. I saw these in the order of milliseconds based on my local runs (keeping in mind that I'm on ARM64 and don't have issues with this).

{"file":"lambda-runtime-init/lambda/rapi/server.go:108","func":"go.amzn.com/lambda/rapi.(*Server).Listen","level":"debug","msg":"Runtime API Server listening on 127.0.0.1:9001","time":"2025-10-17T19:35:58+02:00"}
{"file":"lambda-runtime-init/lambda/rapi/middleware/middleware.go:76","func":"go.amzn.com/lambda/rapi/middleware.AccessLogMiddleware.func1.1","level":"debug","msg":"API request - GET /2018-06-01/ping, Headers:map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]]","time":"2025-10-17T19:35:59+02:00"}
{"file":"lambda-runtime-init/lambda/rapi/middleware/middleware.go:76","func":"go.amzn.com/lambda/rapi/middleware.AccessLogMiddleware.func1.1","level":"debug","msg":"API request - GET /2018-06-01/ping, Headers:map[Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]]","time":"2025-10-17T19:35:59+02:00"}
{"file":"ambda-runtime-init/cmd/localstack/main.go:245","func":"main.main","level":"debug","msg":"Starting runtime init.","time":"2025-10-17T19:36:04+02:00"}

In anycase, these checks are probably more conservative than what is necessary so increasing frequency and decreasing timeout duration seems logical.

Otherwise, if we're trying to make this as fast as possible, so as to not delay startup times, we can also do a single check to see if the port is open (similar to LocalStack's is_port_open()) with some timeout of 5 seconds (or something equivalently short).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way - I just think 500ms for delay is too much, especially since the first one will for quite some systems fail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK changed to 50ms! Happy to 🚢 ?

@gregfurman gregfurman requested a review from dfangl October 21, 2025 14:03
Copy link
Member

@dfangl dfangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for increasing the poll rate!

@gregfurman gregfurman merged commit dc08737 into localstack Oct 22, 2025
1 check passed
@gregfurman gregfurman deleted the fix/race-condition-12680 branch October 22, 2025 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants