Skip to content

Memory Leak in Sidekiq Worker Processes #1935

@ahnv

Description

@ahnv

Sidekiq worker processes exhibit unbounded memory growth in production. RSS increases monotonically across sync cycles and does not recover between jobs, eventually triggering OOM kills.

Instrumentation Added

Two pieces of instrumentation were added to aid reproduction and diagnosis. These changes should be kept for ongoing monitoring but the flamegraph feature should remain opt-in in production.

1. lib/sidekiq/memory_profiling_middleware.rb (new file)
A Sidekiq server middleware that logs GC heap stats and object type deltas after every job:

[MemoryProfiling] job=Sidekiq::ActiveJob::Wrapper queue=high_priority
heap_live_delta=488847 objects_allocated=774155 gc_runs=1
T_OBJECT=+111069 T_STRING=+124374 T_ARRAY=+52819 T_HASH=+37179 T_DATA=+95859

Registered in config/initializers/sidekiq.rb.

2. Process-wide StackProf flamegraph export (in config/initializers/sidekiq.rb)
When SIDEKIQ_FLAMEGRAPH=1 is set, starts a stackprof object-allocation profile on worker boot and writes a .dump file to SIDEKIQ_FLAMEGRAPH_DIR (default /tmp) on graceful shutdown:

[MemoryProfiling] Process profile written: /tmp/stackprof_process_88444_1234567890.dump

Convert with:

bundle exec stackprof --json /tmp/stackprof_process_*.dump > flame.json
# upload flame.json to speedscope.app → Sandwich view, sort by Self

stackprof gem moved out of the development group in Gemfile so it is available in production.


Observed Behavior

Per-job middleware shows each SyncJob retaining ~488k objects after GC:

heap_live_delta=488847 objects_allocated=774155 gc_runs=1
T_OBJECT=+111069 T_STRING=+124374 T_ARRAY=+52819 T_HASH=+37179 T_DATA=+95859

With Sidekiq concurrency of 3, a single sync cycle retains ~1.4M objects. Workers are killed by OOM before completing their queue.

Investigation

Object allocation flamegraph (Sandwich view, sorted by Self) identified:

Method Self allocations %
Class#new 63,163 12%
Kernel#dup 47,289 8.7%
Kernel#BigDecimal 42,569 7.8%
BigDecimal#round 31,273 5.7%
PG::Result#each 21,592 4.0%
ActiveModel::AttributeAssignment#_assign_attribute 21,420 3.9%
ActiveRecord::AttributeAssignment#_assign_attributes 21,405 3.9%
Balance::Materializer#persist_balances 21,202 3.9%
Balance::SyncCache#get_entries 3,199 0.6%

Date#upto accounts for 45% of total allocations as the call tree root.

Suspected Root Cause

Balance::Materializer#persist_balances and Holding::Materializer#persist_holdings instantiate full AR models (Balance.new(...), Holding.new(...)) for every date in the calculation range, then immediately call .attributes.slice(...) to serialize them for upsert_all — incurring Class#new, Kernel#dup, and the full AR attribute machinery per row with no benefit.

Balance::SyncCache#converted_entries calls e.dup on every Entry AR model for FX conversion (secondary issue).

Files to Investigate

  • app/models/balance/materializer.rb — persist_balances
  • app/models/balance/base_calculator.rb — build_balance
  • app/models/holding/materializer.rb — persist_holdings
  • app/models/holding/forward_calculator.rb / reverse_calculator.rb — build_holdings
  • app/models/balance/sync_cache.rb — converted_entries

Steps to Reproduce

SIDEKIQ_FLAMEGRAPH=1 SIDEKIQ_FLAMEGRAPH_INTERVAL=50 bin/dev
# trigger a sync, then Ctrl+C for graceful shutdown
bundle exec stackprof --json /tmp/stackprof_process_*.dump > flame.json
# upload flame.json to speedscope.app → Sandwich view, sort by Self

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions