Skip to content

iqm target (emulate): specific pair of exp_pauli terms takes >4 h at 22 qubits (dense multi-qubit block in qpp::applyCTRL); each term alone ~2 min #4730

@KarimElgammal

Description

@KarimElgammal

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

On the iqm target with emulate=True, sampling a 22-qubit kernel that applies two specific weight-4 exp_pauli terms takes over 4 hours, while each term alone, and seven other two-term circuits of the same shape, take about 2 minutes. The cost is shot-independent (100 and 5000 shots grind identically). The same kernel and arguments run in seconds on the nvidia and qpp-cpu targets.

perf on the live process shows 98.4% of cycles in qpp::applyCTRL (libnvqir-dm.so), and the flat ~8.7 GB RSS matches a dense complex<double> operator over ~14-15 qubits (2^29 x 16 B). So the lowering appears to fuse this particular pair into one very large dense block that is then applied elementwise, instead of the usual small-gate decomposition. Shot-independence follows: the state is prepared once, sampling is cheap.

Steps to reproduce the bug

import cudaq
# needs a valid IQM_TOKEN + network for the architecture fetch, even in emulate mode
cudaq.set_target("iqm", url="https://cocos.resonance.meetiqm.com/emerald", emulate=True)

@cudaq.kernel
def kernel(n_qubits: int, n_electrons: int, coeffs: list[float], words: list[cudaq.pauli_word]):
    q = cudaq.qvector(n_qubits)
    for i in range(n_electrons):
        x(q[i])
    for i in range(len(coeffs)):
        exp_pauli(coeffs[i], q, words[i])

words = [cudaq.pauli_word(w) for w in [
    "IIIIIIIIIIIIIIIIIIIIII",
    "IIIIXIIIIXIIIIYIIIIIIX",   # ~2 min alone
    "XIIIIIIIIIIIXIIIYIXIII",   # ~2 min alone; the pair grinds > 4 h
]]
coeffs = [1.0, 0.000178, 0.000815]
cudaq.sample(kernel, 22, 14, coeffs, words, shots_count=100)

Expected behavior

The pair should execute in roughly the same ~2 minutes as its sibling pairs, or degrade gracefully.

Is this a regression? If it is, put the last known working version (or commit) here.

Unknown (observed on 0.14.0; other versions not tested).

Environment

  • CUDA-Q version: cuda-quantum-cu12 0.14.0 (pip)
  • Python version: 3.11
  • C++ compiler: n/a (pip wheel)
  • Operating system: Rocky Linux 8, CPU-only node

Suggestions

Can share timing logs and further reproducer variants if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions