Reuse the cuStateVec handle across kernel executions by ikkoham · Pull Request #4746 · NVIDIA/cuda-quantum

ikkoham · 2026-06-16T07:53:24Z

Summary

CuStateVecCircuitSimulator calls custatevecCreate on every state-vector allocation and custatevecDestroy in deallocateStateImpl, i.e. it creates+destroys the cuStateVec handle on every kernel execution. The handle is a device-level context (independent of any state vector), so it can be created once per device and reused. This adds ensureHandle() which creates the handle lazily and recreates it only when the active CUDA device changes, drops the per-deallocation destroy, and destroys the handle once in the destructor.

Benchmark

call	before	after
`get_state`	2.53 ms	0.16 ms (~15×)
`observe` (shots=0)	2.60 ms	0.25 ms (~10×)
`sample` (1k shots)	2.53 ms	0.53 ms (~4.7×)
`run` (100 shots)	201 ms	10 ms (~20×)

These benchmarks do not include the first execution. (warm)

Benchmark kernels

import time, cudaq
from cudaq import spin
NQ, THETA, LAYERS = 5, 0.5, 1

# for observe and get_state
@cudaq.kernel
def ansatz(n: int, theta: float, layers: int):
    q = cudaq.qvector(n)
    for _ in range(layers):
        h(q[0])
        for i in range(n - 1):
            x.ctrl(q[i], q[i + 1])
        ry(theta, q[0])

# for sample
@cudaq.kernel
def ansatz_measured(n: int, theta: float, layers: int):
    q = cudaq.qvector(n)
    for _ in range(layers):
        h(q[0])
        for i in range(n - 1):
            x.ctrl(q[i], q[i + 1])
        ry(theta, q[0])
    mz(q)

# for run
@cudaq.kernel
def ansatz_returning(n: int, theta: float, layers: int) -> bool:
    q = cudaq.qvector(n)
    for _ in range(layers):
        h(q[0])
        for i in range(n - 1):
            x.ctrl(q[i], q[i + 1])
        ry(theta, q[0])
    return mz(q[0])

The cuStateVec handle create+destroy round-trip is ~1–2 ms (warm), paid per launch before this change. run benefits most because it re-executes the kernel per shot, paying N handle create/destroys per call.

Signed-off-by: ikkoham <ikkoham@users.noreply.github.com>

copy-pr-bot · 2026-06-16T07:53:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ikkoham · 2026-06-16T08:05:47Z

/ok to test 7588733

Command Bot: Processing...

github-actions · 2026-06-16T08:44:51Z

CI Summary (`push`) — ✅ passed

Run #27603401948 · ✅ 6 · ⏩ 7 · ❌ 0 · ⛔ 0

Top-level jobs (13)

Job	Result
`binaries`	⏩ skipped
`build_and_test`	✅ success
`config_devdeps`	✅ success
`config_source_build`	⏩ skipped
`config_wheeldeps`	✅ success
`devdeps`	✅ success
`docker_image`	⏩ skipped
`gen_code_coverage`	⏩ skipped
`metadata`	✅ success
`python_metapackages`	⏩ skipped
`python_wheels`	⏩ skipped
`source_build`	⏩ skipped
`wheeldeps`	✅ success

⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch

Job
`binaries`
`config_source_build`
`docker_image`
`gen_code_coverage`
`python_metapackages`
`python_wheels`
`source_build`

All sub-jobs (42) — every matrix leg, with links

Job	Status	Link
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Python)	✅ success	view
CI Summary	❔ in_progress	view
Configure build (devdeps)	✅ success	view
Configure build (source_build)	⏩ skipped	view
Configure build (wheeldeps)	✅ success	view
Create CUDA Quantum installer	⏩ skipped	view
Create Docker images	⏩ skipped	view
Create Python metapackages	⏩ skipped	view
Create Python wheels	⏩ skipped	view
Gen code coverage	⏩ skipped	view
Load dependencies (amd64, gcc12) / Caching	✅ success	view
Load dependencies (amd64, gcc12) / Finalize	✅ success	view
Load dependencies (amd64, gcc12) / Metadata	✅ success	view
Load dependencies (amd64, llvm) / Caching	✅ success	view
Load dependencies (amd64, llvm) / Finalize	✅ success	view
Load dependencies (amd64, llvm) / Metadata	✅ success	view
Load dependencies (arm64, gcc12) / Caching	✅ success	view
Load dependencies (arm64, gcc12) / Finalize	✅ success	view
Load dependencies (arm64, gcc12) / Metadata	✅ success	view
Load dependencies (arm64, llvm) / Caching	✅ success	view
Load dependencies (arm64, llvm) / Finalize	✅ success	view
Load dependencies (arm64, llvm) / Metadata	✅ success	view
Load source build cache	⏩ skipped	view
Load wheel dependencies (amd64, 12.6) / Caching	✅ success	view
Load wheel dependencies (amd64, 12.6) / Finalize	✅ success	view
Load wheel dependencies (amd64, 12.6) / Metadata	✅ success	view
Load wheel dependencies (amd64, 13.0) / Caching	✅ success	view
Load wheel dependencies (amd64, 13.0) / Finalize	✅ success	view
Load wheel dependencies (amd64, 13.0) / Metadata	✅ success	view
Load wheel dependencies (arm64, 12.6) / Caching	✅ success	view
Load wheel dependencies (arm64, 12.6) / Finalize	✅ success	view
Load wheel dependencies (arm64, 12.6) / Metadata	✅ success	view
Load wheel dependencies (arm64, 13.0) / Caching	✅ success	view
Load wheel dependencies (arm64, 13.0) / Finalize	✅ success	view
Load wheel dependencies (arm64, 13.0) / Metadata	✅ success	view
Prepare cache clean-up	❔ in_progress	view
Retrieve PR info	✅ success	view

✅ Required checks (6/6) — declared in .github/required-checks.yml for push

Required check	Status	Link
Build and test (amd64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python)	✅ success	view

1tnguyen

LGTM 👍 Thanks @ikkoham

[nvqir] Reuse the cuStateVec handle across kernel executions

7588733

Signed-off-by: ikkoham <ikkoham@users.noreply.github.com>

ikkoham added the performance label Jun 16, 2026

1tnguyen approved these changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse the cuStateVec handle across kernel executions#4746

Reuse the cuStateVec handle across kernel executions#4746
ikkoham wants to merge 1 commit into
NVIDIA:mainfrom
ikkoham:perf/cusv-handle-reuse

ikkoham commented Jun 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

ikkoham commented Jun 16, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

1tnguyen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ikkoham commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

ikkoham commented Jun 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Summary (push) — ✅ passed

Uh oh!

1tnguyen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikkoham commented Jun 16, 2026 •

edited

Loading

ikkoham commented Jun 16, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented Jun 16, 2026 •

edited

Loading

CI Summary (`push`) — ✅ passed