Background Work Profiling by lightsighter · Pull Request #412 · StanfordLegion/realm

lightsighter · 2026-03-17T18:47:16Z

This adds a binary profiling system for Realm's background work items, producing .bin files in the RBWP (Realm Background Work Profile) format that can be loaded by Legion Prof alongside normal profiling logs.

Motivation

Background worker threads perform significant work (DMA transfers, active message handling, GPU event reaping, dependent partitioning) that was previously invisible to profilers. This makes it difficult to diagnose performance issues caused by background work contention or to understand how background work interacts with application-level tasks.

Profiling levels

Level 1 (-ll:bgworkprofile 1): Coarse-grained profiling. Records the start/stop time of each background work item's do_work() invocation and any GPU compute kernels launched during that work.
Level 2 (-ll:bgworkprofile 2): Fine-grained profiling. Adds sub-item timing within each do_work() call, identifying specific
operations (individual AM handler invocations, per-XferDes progress_xd() calls, dependent partitioning operations, GPU event reap operations).

Command-line flags:

-ll:bgworkprofile Profiling level (0=off, 1=coarse, 2=fine) Default: 0
-ll:bgworkprofile_logfile Output file path (% replaced with node ID) Default: bgwork_profile_%.bin
-ll:bgworkprofile_bufsize Max in-memory buffer before streaming to disk (0=unlimited) Default: 1024

Binary file format (RBWP)

A compact binary format with:
- 36-byte header (magic, version, flags, node ID, zero time, descriptor counts, descriptor table offset)
- Data blocks streamed to disk during the run, containing variable-length records with big-endian delta-encoded timestamps
- Descriptor tables appended at end of file at shutdown (when all work items and sub-items have been registered), with the header patched to point to them
- Record types: COARSE_BEGIN/END (level 1), FINE_BEGIN/END (level 2), GPU_WORK (GPU kernel timing)
- Optional zlib compression per data block
The streaming design allows bounded memory usage for long-running applications. Data blocks are buffered in memory and flushed to disk when the buffer exceeds the configured size. Descriptor tables are written at shutdown to ensure all dynamically-registered items are included.

GPU kernel timing

Background work items that launch GPU compute kernels (batch affine copies, transpose kernels, fill kernels, reduction kernels) are instrumented to capture host-side timing of when those kernels execute on the GPU. This uses GPUCompletionNotification callbacks on existing (untimed) CUDA/HIP events — a start notification is placed on the stream before kernel submission and an end notification after, with host-side timestamps recorded when each event is reaped. The resulting GPU_WORK records carry the GPU processor ID so profilers can place them on the correct GPU device timeline.

Only compute kernel launches are instrumented, not cuMemcpyAsync/hipMemcpyAsync calls, since those run on copy engine hardware and are already represented by copy channels in the profiler.

Implementation details

New files:
- bgwork_profile.h — File format specification, manager class, record type constants, thread-local state
- bgwork_profile.inl — Inline recording functions (hot path), timestamp delta encoding
- bgwork_profile.cc — Manager lifecycle, descriptor table writing, streaming block management
- tests/bgwork_profile.cc — Test exercising DMA background work with file validation
Modified files:
- bgwork.h/cc — Added get_slot() accessor on BackgroundWorkItem; coarse profiling calls (bgwork_profile_begin/end) around do_work() dispatch; thread-local state initialization
- runtime_impl.h/cc — Command-line flag parsing, profiler initialization and shutdown hooks
- transfer/channel.h/inl — Added get_bgwork_slot() on SingleXDQChannel; level 2 fine-grained profiling around progress_xd() calls with lazy sub-item registration
- activemsg.h/cc — Level 2 fine-grained profiling around AM handler invocations with per-handler sub-item registration
- deppart/partitions.h/cc — Level 2 fine-grained profiling around dependent partitioning operations
- cuda/cuda_internal.h/cc — BgWorkGpuCudaNotification class; GPU kernel timing instrumentation in GPUXferDes::progress_xd(),
  GPUfillXferDes::progress_xd(), GPUreduceXferDes::progress_xd()
- cuda/cuda_module.cc — BgWorkGpuCudaNotification implementation; level 2 GPU reap sub-item registration
- hip/hip_internal.h/cc — BgWorkGpuHipNotification class; GPU kernel timing in GPUreduceXferDes::progress_xd()
- hip/hip_module.cc — BgWorkGpuHipNotification implementation; level 2 GPU reap sub-item registration
Design decisions
- Streaming with appended descriptors: Data blocks are streamed to disk during the run to bound memory. Descriptor tables are appendedat end of file at shutdown since work items and sub-items register dynamically during module initialization. The header is patched with final counts and the descriptor table offset.
- No CUDA timed events: GPU timing uses host-side timestamps from event reap callbacks rather than cuEventElapsedTime(), consistent with Realm's existing approach of using CU_EVENT_DISABLE_TIMING events.
- Stateless GPU instrumentation: GPU timing brackets in progress_xd() functions use a local variable — no state carries between invocations.
- HIP parity: The HIP module mirrors the CUDA instrumentation. Only GPUreduceXferDes is instrumented on the HIP side since the HIP transfer/fill paths don't launch compute kernels.
Usage

./my_app -ll:bgworkprofile 2 -ll:bgworkprofile_logfile bgwork_%.bin -ll:bgworkprofile_bufsize 512

The % is replaced with the node ID for multi-node runs. The resulting .bin files are passed to Legion Prof alongside normal log files
— format detection is based on the RBWP header magic, not filenames.

… for background work items

codecov · 2026-03-17T18:47:25Z

Codecov Report

❌ Patch coverage is 5.25114% with 415 lines in your changes missing coverage. Please review.
✅ Project coverage is 28.82%. Comparing base (87ec0ad) to head (20b13e1).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/realm/bgwork.cc	7.00%	239 Missing ⚠️
src/realm/bgwork.inl	0.00%	118 Missing and 6 partials ⚠️
src/realm/activemsg.cc	15.78%	16 Missing ⚠️
src/realm/deppart/partitions.cc	0.00%	13 Missing ⚠️
src/realm/runtime_impl.cc	0.00%	10 Missing ⚠️
src/realm/transfer/channel.inl	20.00%	5 Missing and 3 partials ⚠️
src/realm/bgwork.h	0.00%	4 Missing ⚠️
src/realm/tasks.cc	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #412      +/-   ##
==========================================
- Coverage   29.07%   28.82%   -0.26%     
==========================================
  Files         194      196       +2     
  Lines       40229    40667     +438     
  Branches    14464    14650     +186     
==========================================
+ Hits        11697    11721      +24     
- Misses      27723    28013     +290     
- Partials      809      933     +124

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…e command line is parsed

…non-background work thread

realm: add support for background work profiling that dumps log files…

9382e9a

… for background work items

lightsighter added 14 commits March 17, 2026 12:03

test: fix bgwork profiling test after changing format

5361633

realm: fix sanitizer issues

ab0a069

realm: add mising GPU context pushes for gpu background work timing

7477672

realm: make sure to record background work items registered before th…

05fa106

…e command line is parsed

realm: more work on support for background work profiling

40915d1

fix formatting

1e6bc72

add missing file

a54f48d

more fixes for background work profiling

28ff1de

more refactoring of background work to simplify things

2ff3498

compilation fixes for submodules

0ef1fdb

handle cases where it looks like we are doing background work from a …

50dcacf

…non-background work thread

Merge branch 'main' into mbauer-bgwork-profile

4d6f342

realm: small fix for ucx bgwork profiling

f61646f

test: fix bgwork profile test for windows

20b13e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Background Work Profiling#412

Background Work Profiling#412
lightsighter wants to merge 15 commits intomainfrom
mbauer-bgwork-profile

lightsighter commented Mar 17, 2026

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lightsighter commented Mar 17, 2026

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 17, 2026 •

edited

Loading