Record and rewind CUDA GPU memory state with low overhead, deterministic replay, and Chrome trace export (CUPTI kernel timeline + graph-aware stamps).
Snapshot-only recorder for CUDA device buffers. Assumptions:
- Full snapshots only by default; delta chunks are available in Phase 2.
- Ring buffer wrap markers are supported.
- Retention policy supports DROP_OLDEST and BACKPRESSURE modes.
- Chunks for an epoch are contiguous in ring order.
- Rewind lookup scans epoch table on host.
- No persistence; device memory only.
- No instruction-level replay or CUPTI tracing.
- Tracked region sizes must be multiples of 4 bytes.
cmake -S . -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
.\build\Release\tt_demo.exe
.\build\Release\tt.exe verify --manifest trace\tt_manifest_verify_demo.json
.\build\Release\tt_demo_graph.exe
.\build\Release\tt_demo_graph_patch.exe
.\build\Release\tt_demo_determinism.exe
.\build\Release\tt_demo_verify.exe
.\build\Release\tt_demo_multistream_stress.exe --deps
.\build\Release\tt_demo_multistream_stress.exe --no-deps
.\build\Release\tt_tests.exe
Demo flags:
--no-deltadisables delta capture (snapshots only).--ring-bytes=<n>overrides ring size. If too small for all epochs, rewind verification is skipped.--deterministicenables deterministic capture mode (seedocs/determinism.md).--manifest-out=<path>writes a deterministic manifest JSON.
Graph + trace:
tt_demo_graphcaptures(app work + capture_epoch)as a CUDA Graph and replays it.- Chrome trace output is written to
trace/tt_trace.json. - See
docs/graphs.mdanddocs/trace.mdfor usage and limitations.
CUDA Graph patching:
tt_demo_graph_patchreplays a captured graph and updates per-iteration parameters and recorder controls without rebuilding.- Example:
.\build\Release\tt_demo_graph_patch.exe --iterations=12 --toggle-every=3 - Use
--kernel-patchto exercise kernel node param updates (falls back to recapture if needed).
Open trace in Chrome:
start chrome "chrome://tracing"
RecorderConfig supports:
retention_epochs: keep the last N epochs (0 keeps all until space pressure).overwrite_mode:DROP_OLDEST(discard old epochs to make space) orBACKPRESSURE(fail capture when space is insufficient).deterministic: enforce deterministic capture ordering and prevent epoch drops.enable_manifest: collect per-epoch region hashes for manifest output.
Notes:
ring_bytesmust be a multiple of 32 bytes.
When producers write tracked regions on different streams, the capture stream should wait on a producer-completed event per region. Without dependencies, captures can observe partially-written data. The recorder supports per-region dependencies via CaptureDeps and cudaStreamWaitEvent.
See docs/multistream.md for the dependency model, trace stamps, and a minimal code snippet.
Example commands:
.\build\Release\tt_demo_multistream_stress.exe --deps
.\build\Release\tt_demo_multistream_stress.exe --no-deps
Use deterministic mode to ensure repeated runs generate identical per-epoch buffer hashes and manifest output:
.\build\Release\tt_demo.exe --deterministic --manifest-out=trace\tt_manifest.json
.\build\Release\tt_demo_determinism.exe
See docs/determinism.md for the full contract and limitations.
Verify a recorded run against a manifest and localize mismatches:
.\build\Release\tt.exe verify --manifest trace\tt_manifest_verify_demo.json --trace-annotate
See docs/verify.md for workflow details, limitations, and additional flags.