[DO_NOT_MERGE] Find problems with memory management that kills api and web containers #1961

andresgutgon · 2025-11-20T19:45:24Z

We see spikes in both containers that kill the containers. We need to find out what's causing these problems.

What are in these changes?

Some frontend cache optimizations to avoid hitting so frequently the backend
Some batching when getting spans
Cache optimitations to improve runs. My main theory is that is something related with runs

Interesting (maybe)

  // Pagination is native to Redis
  const runUuids = await redis.zrevrange(
    sortedSetKey,
    start,  // (page - 1) * pageSize
    end     // start + pageSize - 1
  )
  // Only fetch data for this page
  const runs = await redis.hmget(dataHashKey, ...runUuids)

Warning

Requires migration but provides the best scalability for production systems
with 1000+ concurrent runs.

📊 Performance Comparison

Scenario	Original (HGETALL)	Hybrid (HSCAN)	Sorted Sets
Memory (1000 runs)	10 MB	2 MB	0.05 MB/page
Memory (5000 runs)	50 MB	10 MB	0.05 MB/page
Load time (1000 runs)	2s	1s	0.1s
Pagination	In-memory	In-memory	Native Redis
Scales to	~1K runs	~5K runs	Millions

What kind of migration requires using zrevrange?

(and the Sorted Set approach) requires a data structure migration in Redis. Here's what needs to change:

Current Structure (Hash-based)

Single Redis Hash

  Key: runs:active:{workspaceId}:{projectId}
  Type: HASH
  Data: {
    "run-uuid-1": '{"uuid":"run-uuid-1","queuedAt":"2024-...","source":"API"}',
    "run-uuid-2":
  '{"uuid":"run-uuid-2","queuedAt":"2024-...","source":"Playground"}',
    ...
  }

Operations:
• HSET to add a run
• HDEL to remove a run
• HGETALL or HSCAN to list all runs (then sort in-memory)

New Structure (Sorted Set + Hash)

Two Redis data structures:

Sorted Set (Index)

  Key: runs:active:{workspaceId}:{projectId}:index
  Type: SORTED SET
  Data: {
    "run-uuid-1": 1732123456789,  // timestamp as score
    "run-uuid-2": 1732123456790,
    ...
  }

Hash (Full Data)

  Key: runs:active:{workspaceId}:{projectId}:data
  Type: HASH
  Data: {
    "run-uuid-1": '{"uuid":"run-uuid-1",...}',
    "run-uuid-2": '{"uuid":"run-uuid-2",...}',
    ...
  }

Operations
• ZADD + HSET to add a run (2 commands)
• ZREM + HDEL to remove a run (2 commands)
• ZREVRANGE to get paginated UUIDs (sorted by timestamp)
• HMGET to fetch only the runs for current page

You need to update
• create.ts - Where runs are added to cache
• FROM: redis.hset(key, runUuid, JSON.stringify(run))
• TO: redis.zadd(indexKey, timestamp, runUuid) + redis.hset(dataKey,
runUuid, JSON.stringify(run))
• delete.ts - Where runs are removed from cache
• FROM: redis.hdel(key, runUuid)
• TO: redis.zrem(indexKey, runUuid) + redis.hdel(dataKey, runUuid)
• update.ts - Where run data is updated
• FROM: redis.hset(key, runUuid, JSON.stringify(run))
• TO: redis.hset(dataKey, runUuid, JSON.stringify(run)) (only update hash,
score stays same)
• listActive.ts - Where runs are fetched
• FROM: Load all with HGETALL/HSCAN, sort in-memory, slice for pagination
• TO: ZREVRANGE for page UUIDs + HMGET for data (no in-memory sorting)

Data Migration (Redis Layer)
You need a migration script to convert existing data:

  // For each workspace/project with active runs:
  const oldKey = `runs:active:{ws}:{proj}`
  const newIndexKey = `runs:active:{ws}:{proj}:index`
  const newDataKey = `runs:active:{ws}:{proj}:data`
  // 1. Read all runs from old hash
  const runs = await redis.hgetall(oldKey)
  // 2. Write to new structures
  for (const [uuid, jsonData] of Object.entries(runs)) {
    const run = JSON.parse(jsonData)
    const timestamp = run.startedAt || run.queuedAt
    
    await redis.zadd(newIndexKey, timestamp, uuid)
    await redis.hset(newDataKey, uuid, jsonData)
  }
  // 3. Set TTLs on new keys
  await redis.expire(newIndexKey, 3 * 60 * 60)
  await redis.expire(newDataKey, 3 * 60 * 60)
  // 4. Delete old key (after verifying new structure works)
  await redis.del(oldKey)

Deployment Strategy (Zero-Downtime)

Because this changes the data structure, you need a careful rollout:
Option A: Blue-Green Deployment

Deploy new code that reads from BOTH old and new structures
Run migration script to populate new structures
Deploy code that writes to BOTH old and new structures
Verify new structure works correctly
Deploy code that only uses new structure
Clean up old keys

Option B: Feature Flag

Add feature flag USE_SORTED_SET_FOR_RUNS
Deploy code that supports both structures
Enable flag for 1% of workspaces
Gradually increase to 100%
Remove old code after verification

We see spikes in both containers that kill the containers. We need to find out what's causing these problems.

Find problems with memory management that kills api and web containers

e18f81a

We see spikes in both containers that kill the containers. We need to find out what's causing these problems.

andresgutgon added the DO NOT MERGE Not safe to merge label Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO_NOT_MERGE] Find problems with memory management that kills api and web containers #1961

[DO_NOT_MERGE] Find problems with memory management that kills api and web containers #1961

Uh oh!

andresgutgon commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[DO_NOT_MERGE] Find problems with memory management that kills api and web containers #1961

Are you sure you want to change the base?

[DO_NOT_MERGE] Find problems with memory management that kills api and web containers #1961

Uh oh!

Conversation

andresgutgon commented Nov 20, 2025

What are in these changes?

Interesting (maybe)

What kind of migration requires using zrevrange?

Current Structure (Hash-based)

New Structure (Sorted Set + Hash)

Two Redis data structures:

Deployment Strategy (Zero-Downtime)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants