Skip to content

Conversation

@andresgutgon
Copy link
Contributor

We see spikes in both containers that kill the containers. We need to find out what's causing these problems.

What are in these changes?

  • Some frontend cache optimizations to avoid hitting so frequently the backend
  • Some batching when getting spans
  • Cache optimitations to improve runs. My main theory is that is something related with runs

Interesting (maybe)

  // Pagination is native to Redis
  const runUuids = await redis.zrevrange(
    sortedSetKey,
    start,  // (page - 1) * pageSize
    end     // start + pageSize - 1
  )
  // Only fetch data for this page
  const runs = await redis.hmget(dataHashKey, ...runUuids)

Warning

Requires migration but provides the best scalability for production systems
with 1000+ concurrent runs.

📊 Performance Comparison

Scenario Original (HGETALL) Hybrid (HSCAN) Sorted Sets
Memory (1000 runs) 10 MB 2 MB 0.05 MB/page
Memory (5000 runs) 50 MB 10 MB 0.05 MB/page
Load time (1000 runs) 2s 1s 0.1s
Pagination In-memory In-memory Native Redis
Scales to ~1K runs ~5K runs Millions

What kind of migration requires using zrevrange?

(and the Sorted Set approach) requires a data structure migration in Redis. Here's what needs to change:

Current Structure (Hash-based)

Single Redis Hash

  Key: runs:active:{workspaceId}:{projectId}
  Type: HASH
  Data: {
    "run-uuid-1": '{"uuid":"run-uuid-1","queuedAt":"2024-...","source":"API"}',
    "run-uuid-2":
  '{"uuid":"run-uuid-2","queuedAt":"2024-...","source":"Playground"}',
    ...
  }

Operations:
HSET to add a run
HDEL to remove a run
HGETALL or HSCAN to list all runs (then sort in-memory)


New Structure (Sorted Set + Hash)

Two Redis data structures:

Sorted Set (Index)

  Key: runs:active:{workspaceId}:{projectId}:index
  Type: SORTED SET
  Data: {
    "run-uuid-1": 1732123456789,  // timestamp as score
    "run-uuid-2": 1732123456790,
    ...
  }

Hash (Full Data)

  Key: runs:active:{workspaceId}:{projectId}:data
  Type: HASH
  Data: {
    "run-uuid-1": '{"uuid":"run-uuid-1",...}',
    "run-uuid-2": '{"uuid":"run-uuid-2",...}',
    ...
  }

Operations
ZADD + HSET to add a run (2 commands)
ZREM + HDEL to remove a run (2 commands)
ZREVRANGE to get paginated UUIDs (sorted by timestamp)
HMGET to fetch only the runs for current page


You need to update
create.ts - Where runs are added to cache
• FROM: redis.hset(key, runUuid, JSON.stringify(run))
• TO: redis.zadd(indexKey, timestamp, runUuid) + redis.hset(dataKey,
runUuid, JSON.stringify(run))
delete.ts - Where runs are removed from cache
• FROM: redis.hdel(key, runUuid)
• TO: redis.zrem(indexKey, runUuid) + redis.hdel(dataKey, runUuid)
update.ts - Where run data is updated
• FROM: redis.hset(key, runUuid, JSON.stringify(run))
• TO: redis.hset(dataKey, runUuid, JSON.stringify(run)) (only update hash,
score stays same)
listActive.ts - Where runs are fetched
• FROM: Load all with HGETALL/HSCAN, sort in-memory, slice for pagination
• TO: ZREVRANGE for page UUIDs + HMGET for data (no in-memory sorting)

Data Migration (Redis Layer)
You need a migration script to convert existing data:

  // For each workspace/project with active runs:
  const oldKey = `runs:active:{ws}:{proj}`
  const newIndexKey = `runs:active:{ws}:{proj}:index`
  const newDataKey = `runs:active:{ws}:{proj}:data`
  // 1. Read all runs from old hash
  const runs = await redis.hgetall(oldKey)
  // 2. Write to new structures
  for (const [uuid, jsonData] of Object.entries(runs)) {
    const run = JSON.parse(jsonData)
    const timestamp = run.startedAt || run.queuedAt
    
    await redis.zadd(newIndexKey, timestamp, uuid)
    await redis.hset(newDataKey, uuid, jsonData)
  }
  // 3. Set TTLs on new keys
  await redis.expire(newIndexKey, 3 * 60 * 60)
  await redis.expire(newDataKey, 3 * 60 * 60)
  // 4. Delete old key (after verifying new structure works)
  await redis.del(oldKey)

Deployment Strategy (Zero-Downtime)

Because this changes the data structure, you need a careful rollout:
Option A: Blue-Green Deployment

  1. Deploy new code that reads from BOTH old and new structures
  2. Run migration script to populate new structures
  3. Deploy code that writes to BOTH old and new structures
  4. Verify new structure works correctly
  5. Deploy code that only uses new structure
  6. Clean up old keys

Option B: Feature Flag

  1. Add feature flag USE_SORTED_SET_FOR_RUNS
  2. Deploy code that supports both structures
  3. Enable flag for 1% of workspaces
  4. Gradually increase to 100%
  5. Remove old code after verification

We see spikes in both containers that kill the containers. We need to
find out what's causing these problems.
@andresgutgon andresgutgon added the DO NOT MERGE Not safe to merge label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE Not safe to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants