Skip to content

fix: Slow cold start when workflows are inactive > 1 day#3830

Merged
juliusgeo merged 5 commits into
mainfrom
fix_slow_inactive_workflow
May 6, 2026
Merged

fix: Slow cold start when workflows are inactive > 1 day#3830
juliusgeo merged 5 commits into
mainfrom
fix_slow_inactive_workflow

Conversation

@juliusgeo
Copy link
Copy Markdown
Contributor

Description

When a workflow is triggered that was last triggered > 1 day prior, the call to l.lr.ListQueues in acquireQueueLeases does not pick up that workflow's queue lease. acquireQueueLeases is called with a polling interval of 5 seconds, which means that if a workflow is triggered, it would have to wait up to 5 seconds before acquireQueueLeases will be called, pick up the queue lease, notify the tenant manager, and finally have addQueuer get called in TenantManager.listenForQueueLeases to actually create the queue and have it get picked up in runOptimisticScheduling. This PR changes the logic in the TenantManager such that when a check-tenant-queue message is received, instead of just waking up the existing queues that exist in memory, it also creates new queues for queues that do not exist in memory, thus avoiding the aforementioned wait for the acquireQueueLeases poll.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation change (pure documentation change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking changes to code which doesn't change any behaviour)
  • CI (any automation pipeline changes)
  • Chore (changes which are not directly related to any business logic)
  • Test changes (add, refactor, improve or change a test)
  • This change requires a documentation update

What's Changed

  • Add a list of tasks or features here...

@vercel
Copy link
Copy Markdown

vercel Bot commented May 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment May 6, 2026 2:41pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Benchmark results

goos: linux
goarch: amd64
pkg: github.com/hatchet-dev/hatchet/pkg/scheduling/v1
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
              │ /tmp/old.txt │         /tmp/new.txt         │
              │    sec/op    │   sec/op     vs base         │
RateLimiter-8   50.58µ ± 12%   49.92µ ± 6%  ~ (p=0.818 n=6)

              │ /tmp/old.txt │         /tmp/new.txt          │
              │     B/op     │     B/op      vs base         │
RateLimiter-8   137.7Ki ± 0%   137.7Ki ± 0%  ~ (p=0.797 n=6)

              │ /tmp/old.txt │          /tmp/new.txt          │
              │  allocs/op   │  allocs/op   vs base           │
RateLimiter-8    1.022k ± 0%   1.022k ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

Compared against main (0911c11)

@juliusgeo juliusgeo marked this pull request as ready for review May 6, 2026 16:47
@juliusgeo juliusgeo requested review from abelanger5 and grutt May 6, 2026 16:47
@juliusgeo
Copy link
Copy Markdown
Contributor Author

The failing e2e test fails on main as well, so not related to these changes.

@juliusgeo juliusgeo merged commit ec41099 into main May 6, 2026
89 of 93 checks passed
@juliusgeo juliusgeo deleted the fix_slow_inactive_workflow branch May 6, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants