Skip to content

Background Work Profiling#412

Open
lightsighter wants to merge 15 commits intomainfrom
mbauer-bgwork-profile
Open

Background Work Profiling#412
lightsighter wants to merge 15 commits intomainfrom
mbauer-bgwork-profile

Conversation

@lightsighter
Copy link
Copy Markdown
Contributor

This adds a binary profiling system for Realm's background work items, producing .bin files in the RBWP (Realm Background Work Profile) format that can be loaded by Legion Prof alongside normal profiling logs.

Motivation

Background worker threads perform significant work (DMA transfers, active message handling, GPU event reaping, dependent partitioning) that was previously invisible to profilers. This makes it difficult to diagnose performance issues caused by background work contention or to understand how background work interacts with application-level tasks.

Profiling levels

  • Level 1 (-ll:bgworkprofile 1): Coarse-grained profiling. Records the start/stop time of each background work item's do_work() invocation and any GPU compute kernels launched during that work.
  • Level 2 (-ll:bgworkprofile 2): Fine-grained profiling. Adds sub-item timing within each do_work() call, identifying specific
    operations (individual AM handler invocations, per-XferDes progress_xd() calls, dependent partitioning operations, GPU event reap operations).

Command-line flags:

  • -ll:bgworkprofile Profiling level (0=off, 1=coarse, 2=fine) Default: 0

  • -ll:bgworkprofile_logfile Output file path (% replaced with node ID) Default: bgwork_profile_%.bin

  • -ll:bgworkprofile_bufsize Max in-memory buffer before streaming to disk (0=unlimited) Default: 1024

    Binary file format (RBWP)

    A compact binary format with:

    • 36-byte header (magic, version, flags, node ID, zero time, descriptor counts, descriptor table offset)
    • Data blocks streamed to disk during the run, containing variable-length records with big-endian delta-encoded timestamps
    • Descriptor tables appended at end of file at shutdown (when all work items and sub-items have been registered), with the header patched to point to them
    • Record types: COARSE_BEGIN/END (level 1), FINE_BEGIN/END (level 2), GPU_WORK (GPU kernel timing)
    • Optional zlib compression per data block

    The streaming design allows bounded memory usage for long-running applications. Data blocks are buffered in memory and flushed to disk when the buffer exceeds the configured size. Descriptor tables are written at shutdown to ensure all dynamically-registered items are included.

    GPU kernel timing

    Background work items that launch GPU compute kernels (batch affine copies, transpose kernels, fill kernels, reduction kernels) are instrumented to capture host-side timing of when those kernels execute on the GPU. This uses GPUCompletionNotification callbacks on existing (untimed) CUDA/HIP events — a start notification is placed on the stream before kernel submission and an end notification after, with host-side timestamps recorded when each event is reaped. The resulting GPU_WORK records carry the GPU processor ID so profilers can place them on the correct GPU device timeline.

    Only compute kernel launches are instrumented, not cuMemcpyAsync/hipMemcpyAsync calls, since those run on copy engine hardware and are already represented by copy channels in the profiler.

    Implementation details

    New files:

    • bgwork_profile.h — File format specification, manager class, record type constants, thread-local state
    • bgwork_profile.inl — Inline recording functions (hot path), timestamp delta encoding
    • bgwork_profile.cc — Manager lifecycle, descriptor table writing, streaming block management
    • tests/bgwork_profile.cc — Test exercising DMA background work with file validation

    Modified files:

    • bgwork.h/cc — Added get_slot() accessor on BackgroundWorkItem; coarse profiling calls (bgwork_profile_begin/end) around do_work() dispatch; thread-local state initialization
    • runtime_impl.h/cc — Command-line flag parsing, profiler initialization and shutdown hooks
    • transfer/channel.h/inl — Added get_bgwork_slot() on SingleXDQChannel; level 2 fine-grained profiling around progress_xd() calls with lazy sub-item registration
    • activemsg.h/cc — Level 2 fine-grained profiling around AM handler invocations with per-handler sub-item registration
    • deppart/partitions.h/cc — Level 2 fine-grained profiling around dependent partitioning operations
    • cuda/cuda_internal.h/cc — BgWorkGpuCudaNotification class; GPU kernel timing instrumentation in GPUXferDes::progress_xd(),
      GPUfillXferDes::progress_xd(), GPUreduceXferDes::progress_xd()
    • cuda/cuda_module.cc — BgWorkGpuCudaNotification implementation; level 2 GPU reap sub-item registration
    • hip/hip_internal.h/cc — BgWorkGpuHipNotification class; GPU kernel timing in GPUreduceXferDes::progress_xd()
    • hip/hip_module.cc — BgWorkGpuHipNotification implementation; level 2 GPU reap sub-item registration

    Design decisions

    • Streaming with appended descriptors: Data blocks are streamed to disk during the run to bound memory. Descriptor tables are appendedat end of file at shutdown since work items and sub-items register dynamically during module initialization. The header is patched with final counts and the descriptor table offset.
    • No CUDA timed events: GPU timing uses host-side timestamps from event reap callbacks rather than cuEventElapsedTime(), consistent with Realm's existing approach of using CU_EVENT_DISABLE_TIMING events.
    • Stateless GPU instrumentation: GPU timing brackets in progress_xd() functions use a local variable — no state carries between invocations.
    • HIP parity: The HIP module mirrors the CUDA instrumentation. Only GPUreduceXferDes is instrumented on the HIP side since the HIP transfer/fill paths don't launch compute kernels.

    Usage

    ./my_app -ll:bgworkprofile 2 -ll:bgworkprofile_logfile bgwork_%.bin -ll:bgworkprofile_bufsize 512

    The % is replaced with the node ID for multi-node runs. The resulting .bin files are passed to Legion Prof alongside normal log files
    — format detection is based on the RBWP header magic, not filenames.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 5.25114% with 415 lines in your changes missing coverage. Please review.
✅ Project coverage is 28.82%. Comparing base (87ec0ad) to head (20b13e1).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/realm/bgwork.cc 7.00% 239 Missing ⚠️
src/realm/bgwork.inl 0.00% 118 Missing and 6 partials ⚠️
src/realm/activemsg.cc 15.78% 16 Missing ⚠️
src/realm/deppart/partitions.cc 0.00% 13 Missing ⚠️
src/realm/runtime_impl.cc 0.00% 10 Missing ⚠️
src/realm/transfer/channel.inl 20.00% 5 Missing and 3 partials ⚠️
src/realm/bgwork.h 0.00% 4 Missing ⚠️
src/realm/tasks.cc 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #412      +/-   ##
==========================================
- Coverage   29.07%   28.82%   -0.26%     
==========================================
  Files         194      196       +2     
  Lines       40229    40667     +438     
  Branches    14464    14650     +186     
==========================================
+ Hits        11697    11721      +24     
- Misses      27723    28013     +290     
- Partials      809      933     +124     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant