iqm target (emulate): specific pair of exp_pauli terms takes >4 h at 22 qubits (dense multi-qubit block in qpp::applyCTRL); each term alone ~2 min

### Required prerequisites

- [x] Consult the [security policy](https://github.com/NVIDIA/cuda-quantum/security/policy). If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
- [x] Make sure you've read the [documentation](https://nvidia.github.io/cuda-quantum/latest). Your issue may be addressed there.
- [x] Search the [issue tracker](https://github.com/NVIDIA/cuda-quantum/issues) to verify that this hasn't already been reported. +1 or comment there if it has.
- [ ] If possible, make a PR with a failing test to give us a starting point to work on!

### Describe the bug

On the `iqm` target with `emulate=True`, sampling a 22-qubit kernel that applies two specific weight-4 `exp_pauli` terms takes **over 4 hours**, while each term alone, and seven other two-term circuits of the same shape, take about 2 minutes. The cost is shot-independent (100 and 5000 shots grind identically). The same kernel and arguments run in seconds on the `nvidia` and `qpp-cpu` targets.

`perf` on the live process shows 98.4% of cycles in `qpp::applyCTRL` (`libnvqir-dm.so`), and the flat ~8.7 GB RSS matches a dense `complex<double>` operator over ~14-15 qubits (2^29 x 16 B). So the lowering appears to fuse this particular pair into one very large dense block that is then applied elementwise, instead of the usual small-gate decomposition. Shot-independence follows: the state is prepared once, sampling is cheap.

### Steps to reproduce the bug

```python
import cudaq
# needs a valid IQM_TOKEN + network for the architecture fetch, even in emulate mode
cudaq.set_target("iqm", url="https://cocos.resonance.meetiqm.com/emerald", emulate=True)

@cudaq.kernel
def kernel(n_qubits: int, n_electrons: int, coeffs: list[float], words: list[cudaq.pauli_word]):
    q = cudaq.qvector(n_qubits)
    for i in range(n_electrons):
        x(q[i])
    for i in range(len(coeffs)):
        exp_pauli(coeffs[i], q, words[i])

words = [cudaq.pauli_word(w) for w in [
    "IIIIIIIIIIIIIIIIIIIIII",
    "IIIIXIIIIXIIIIYIIIIIIX",   # ~2 min alone
    "XIIIIIIIIIIIXIIIYIXIII",   # ~2 min alone; the pair grinds > 4 h
]]
coeffs = [1.0, 0.000178, 0.000815]
cudaq.sample(kernel, 22, 14, coeffs, words, shots_count=100)
```

### Expected behavior

The pair should execute in roughly the same ~2 minutes as its sibling pairs, or degrade gracefully.

### Is this a regression? If it is, put the last known working version (or commit) here.

Unknown (observed on 0.14.0; other versions not tested).

### Environment

- **CUDA-Q version**: cuda-quantum-cu12 0.14.0 (pip)
- **Python version**: 3.11
- **C++ compiler**: n/a (pip wheel)
- **Operating system**: Rocky Linux 8, CPU-only node

### Suggestions

Can share timing logs and further reproducer variants if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iqm target (emulate): specific pair of exp_pauli terms takes >4 h at 22 qubits (dense multi-qubit block in qpp::applyCTRL); each term alone ~2 min #4730

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

iqm target (emulate): specific pair of exp_pauli terms takes >4 h at 22 qubits (dense multi-qubit block in qpp::applyCTRL); each term alone ~2 min #4730

Description

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions