You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When executing a run, we make an HTTP request to the client's endpoint URL and keep that request open until we get a response. Client's are expected to respond in the following situations:
The run experiences an uncaught exception
The run completes
3.io.wait() or io.yield() is called.
There is an error inside of a Task callback
The serverless function times out
For 3, 4, and 5, we'll retry the run execution at a later time, using the cached outputs from completed tasks to prevent duplicate task executions, and to attempt to get to a complete state of the run.
For cases other than when we receive a response via a function timeout we have complete control over the timing and can make pretty strong guarantees that tasks or only executed exactly once.
Unfortunately, we don't control the point at which a function times out, which causes our task execution guarantees to become at-least once if the function times out after the task is executed but before we're able to update the task status/output on the server.
This can lead to situations where tasks are executed twice, and depending on the work being done in the task, cause unwanted behavior. We mitigate this situation with our integrations that support idempotency keys (e.g. our stripe.com integration) as well as providing idempotency keys when using io.runTask(), but very few integrations actually support idempotency keys.
This will be an ongoing area of improvement for a system like Trigger.dev, and there are multiple fronts where we can make improvements, but one thing we can do in the short term without too many changes or risks is to do Auto Execution Yielding to yield execution right before a function is going to timeout.
Auto Execution Yielding
If we can detect the value of a function execution limit, we could provide that information when executing a run and yield execution before we reach the limit, in between task executions. This would make it much less likely that a task would be executed twice because we'd (almost never) hit the function execution timeout. This would also be better from a DX standpoint of not getting 504 timeout errors in your function execution logs (e.g. on Vercel) which could cause problems with monitoring noise.
Detecting function execution limit
Since Vercel does not provide the function execution limit in environment variables, we could pretty easily detect function execution limits by implementing "execution limit probes" that would basically make a request to our endpoint and wait. We would time how long it took to receive a 504 timeout error (with probably our own request timeout to handle situations where the timeout is really high). We'd do this periodically to ensure any changes in the function execution runtime are detected. Once we have this data we can save it to the Endpoint table and pass it down as an optional piece of data in the run execution request, to the client.
Auto Yielding
Once the client gets the execution limit in a run execution request, it can measure the time elapsed since the function started executing. Then we can yield execution before a task is executed if there isn't enough time left at the following points:
Before a task is executed
After a task is executed, but before the server is updated with task output/status
After the server is updated with task output/status
A further iteration of this feature would include historical measurement of task execution times and the ability to better predict when a task is likely to take longer than the available time left. But that can build off the initial work of this feature as a further improvement.
Temporary Workarounds
To workaround this issue before this feature is available, you can use io.yield() to force exiting the current function execution and to resume the run in a new function execution, picking up where it left off. You could put these in strategic points of your job run, along with your knowledge of your own function timeout limit to make it less likely tasks are executed more than once:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When executing a run, we make an HTTP request to the client's endpoint URL and keep that request open until we get a response. Client's are expected to respond in the following situations:
3.
io.wait()orio.yield()is called.For 3, 4, and 5, we'll retry the run execution at a later time, using the cached outputs from completed tasks to prevent duplicate task executions, and to attempt to get to a complete state of the run.
For cases other than when we receive a response via a function timeout we have complete control over the timing and can make pretty strong guarantees that tasks or only executed exactly once.
Unfortunately, we don't control the point at which a function times out, which causes our task execution guarantees to become at-least once if the function times out after the task is executed but before we're able to update the task status/output on the server.
This can lead to situations where tasks are executed twice, and depending on the work being done in the task, cause unwanted behavior. We mitigate this situation with our integrations that support idempotency keys (e.g. our stripe.com integration) as well as providing idempotency keys when using io.runTask(), but very few integrations actually support idempotency keys.
This will be an ongoing area of improvement for a system like Trigger.dev, and there are multiple fronts where we can make improvements, but one thing we can do in the short term without too many changes or risks is to do Auto Execution Yielding to yield execution right before a function is going to timeout.
Auto Execution Yielding
If we can detect the value of a function execution limit, we could provide that information when executing a run and yield execution before we reach the limit, in between task executions. This would make it much less likely that a task would be executed twice because we'd (almost never) hit the function execution timeout. This would also be better from a DX standpoint of not getting 504 timeout errors in your function execution logs (e.g. on Vercel) which could cause problems with monitoring noise.
Detecting function execution limit
Since Vercel does not provide the function execution limit in environment variables, we could pretty easily detect function execution limits by implementing "execution limit probes" that would basically make a request to our endpoint and wait. We would time how long it took to receive a 504 timeout error (with probably our own request timeout to handle situations where the timeout is really high). We'd do this periodically to ensure any changes in the function execution runtime are detected. Once we have this data we can save it to the
Endpointtable and pass it down as an optional piece of data in the run execution request, to the client.Auto Yielding
Once the client gets the execution limit in a run execution request, it can measure the time elapsed since the function started executing. Then we can yield execution before a task is executed if there isn't enough time left at the following points:
A further iteration of this feature would include historical measurement of task execution times and the ability to better predict when a task is likely to take longer than the available time left. But that can build off the initial work of this feature as a further improvement.
Temporary Workarounds
To workaround this issue before this feature is available, you can use io.yield() to force exiting the current function execution and to resume the run in a new function execution, picking up where it left off. You could put these in strategic points of your job run, along with your knowledge of your own function timeout limit to make it less likely tasks are executed more than once:
Beta Was this translation helpful? Give feedback.
All reactions