Required prerequisites
Describe the bug
On the iqm target with emulate=True, sampling a 22-qubit kernel that applies two specific weight-4 exp_pauli terms takes over 4 hours, while each term alone, and seven other two-term circuits of the same shape, take about 2 minutes. The cost is shot-independent (100 and 5000 shots grind identically). The same kernel and arguments run in seconds on the nvidia and qpp-cpu targets.
perf on the live process shows 98.4% of cycles in qpp::applyCTRL (libnvqir-dm.so), and the flat ~8.7 GB RSS matches a dense complex<double> operator over ~14-15 qubits (2^29 x 16 B). So the lowering appears to fuse this particular pair into one very large dense block that is then applied elementwise, instead of the usual small-gate decomposition. Shot-independence follows: the state is prepared once, sampling is cheap.
Steps to reproduce the bug
import cudaq
# needs a valid IQM_TOKEN + network for the architecture fetch, even in emulate mode
cudaq.set_target("iqm", url="https://cocos.resonance.meetiqm.com/emerald", emulate=True)
@cudaq.kernel
def kernel(n_qubits: int, n_electrons: int, coeffs: list[float], words: list[cudaq.pauli_word]):
q = cudaq.qvector(n_qubits)
for i in range(n_electrons):
x(q[i])
for i in range(len(coeffs)):
exp_pauli(coeffs[i], q, words[i])
words = [cudaq.pauli_word(w) for w in [
"IIIIIIIIIIIIIIIIIIIIII",
"IIIIXIIIIXIIIIYIIIIIIX", # ~2 min alone
"XIIIIIIIIIIIXIIIYIXIII", # ~2 min alone; the pair grinds > 4 h
]]
coeffs = [1.0, 0.000178, 0.000815]
cudaq.sample(kernel, 22, 14, coeffs, words, shots_count=100)
Expected behavior
The pair should execute in roughly the same ~2 minutes as its sibling pairs, or degrade gracefully.
Is this a regression? If it is, put the last known working version (or commit) here.
Unknown (observed on 0.14.0; other versions not tested).
Environment
- CUDA-Q version: cuda-quantum-cu12 0.14.0 (pip)
- Python version: 3.11
- C++ compiler: n/a (pip wheel)
- Operating system: Rocky Linux 8, CPU-only node
Suggestions
Can share timing logs and further reproducer variants if useful.
Required prerequisites
Describe the bug
On the
iqmtarget withemulate=True, sampling a 22-qubit kernel that applies two specific weight-4exp_pauliterms takes over 4 hours, while each term alone, and seven other two-term circuits of the same shape, take about 2 minutes. The cost is shot-independent (100 and 5000 shots grind identically). The same kernel and arguments run in seconds on thenvidiaandqpp-cputargets.perfon the live process shows 98.4% of cycles inqpp::applyCTRL(libnvqir-dm.so), and the flat ~8.7 GB RSS matches a densecomplex<double>operator over ~14-15 qubits (2^29 x 16 B). So the lowering appears to fuse this particular pair into one very large dense block that is then applied elementwise, instead of the usual small-gate decomposition. Shot-independence follows: the state is prepared once, sampling is cheap.Steps to reproduce the bug
Expected behavior
The pair should execute in roughly the same ~2 minutes as its sibling pairs, or degrade gracefully.
Is this a regression? If it is, put the last known working version (or commit) here.
Unknown (observed on 0.14.0; other versions not tested).
Environment
Suggestions
Can share timing logs and further reproducer variants if useful.