Tensornet target fails to initialize in Docker image when using remote-mqpu #2537

bmhowe23 · 2025-01-24T19:00:19Z

Required prerequisites

Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

With the current Docker image, the following test fails. The source file is directly from the GitHub repo: https://github.com/NVIDIA/cuda-quantum/blob/main/docs/sphinx/snippets/python/using/cudaq/platform/sample_async_remote.py

cudaq@63a0dd312d4c:~$ wget https://raw.githubusercontent.com/NVIDIA/cuda-quantum/refs/heads/main/docs/sphinx/snippets/python/using/cudaq/platform/sample_async_remote.py
<SNIP>
cudaq@63a0dd312d4c:~$ python3 sample_async_remote.py
Number of virtual QPUs: 2
Sampling jobs launched for asynchronous processing.
cuTensorNet error CUTENSORNET_STATUS_DISTRIBUTED_FAILURE in line 92
[63a0dd312d4c:00095] *** Process received signal ***
[63a0dd312d4c:00095] Signal: Aborted (6)
[63a0dd312d4c:00095] Signal code:  (-6)
[63a0dd312d4c:00095] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe4765ce520]
[63a0dd312d4c:00095] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe4766229fc]
[63a0dd312d4c:00095] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe4765ce476]
[63a0dd312d4c:00095] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe4765b47f3]
[63a0dd312d4c:00095] [ 4] /opt/nvidia/cudaq/lib/libnvqir-tensornet.so(_Z19initCuTensornetCommPv+0x18a)[0x7fe46c03eeca]
[63a0dd312d4c:00095] [ 5] /opt/nvidia/cudaq/lib/libnvqir-tensornet.so(getCircuitSimulator_tensornet+0x16d)[0x7fe46c08d1ad]
[63a0dd312d4c:00095] [ 6] /opt/nvidia/cudaq/lib/librest-remote-platform-server.so(_ZN5cudaq23getUniquePluginInstanceIN5nvqir16CircuitSimulatorEEEPT_St17basic_string_viewIcSt11char_traitsIcEEPKc+0xe0)[0x7fe47c29f1b0]
<SNIP>

Note that _Z19initCuTensornetCommPv is in the section of code that only enabled when MPI should be initialized.

The following command is a workaround but not a true fix.

sudo mv /opt/nvidia/cudaq/lib/plugins/libcudaq-comm-plugin.so /opt/nvidia/cudaq/lib/plugins/libcudaq-comm-plugin.so.bak

Steps to reproduce the bug

See above

Expected behavior

See above

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA-Q version: 888fbfb
Python version:
C++ compiler:
Operating system:

Suggestions

No response

The text was updated successfully, but these errors were encountered:

1tnguyen · 2025-01-28T06:04:38Z

The issue is that the CUTENSORNET_COMM_LIB environment variable in the Docker container is set to an invalid value.

This file should be built in

cuda-quantum/docker/build/devdeps.ext.Dockerfile

Lines 186 to 187 in bd49915

    
           RUN cd "$CUQUANTUM_INSTALL_PREFIX/distributed_interfaces/" && source activate_mpi_cutn.sh 
        
           ENV CUTENSORNET_COMM_LIB="$CUQUANTUM_INSTALL_PREFIX/distributed_interfaces/libcutensornet_distributed_interface_mpi.so"

I suspect that the change to cudaq.Dockerfile in #2391 has unintentionally skipped the migration of this libcutensornet_distributed_interface_mpi.so file to the destination image.

We could either restore the file migration or unset CUTENSORNET_COMM_LIB (using the built-in comm wrapper).

bmhowe23 · 2025-01-28T15:41:20Z

Thanks for hunting this down, @1tnguyen. For a fix PR, it would be nice to add the example given above (which already exists in our released Docker image) to our publishing pipeline tests to help prevent it in the future.

bmhowe23 changed the title ~~Tensornet target fails to initialize in Docker image~~ Tensornet target fails to initialize in Docker image when using remote-mqpu Jan 24, 2025

1tnguyen self-assigned this Jan 24, 2025

1tnguyen linked a pull request Feb 3, 2025 that will close this issue

Remove CUTENSORNET_COMM_LIB default activation and add snippet validation to Publishing container validation #2565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensornet target fails to initialize in Docker image when using remote-mqpu #2537

Tensornet target fails to initialize in Docker image when using remote-mqpu #2537

bmhowe23 commented Jan 24, 2025 •

edited

Loading

1tnguyen commented Jan 28, 2025

bmhowe23 commented Jan 28, 2025

Tensornet target fails to initialize in Docker image when using remote-mqpu #2537

Tensornet target fails to initialize in Docker image when using remote-mqpu #2537

Comments

bmhowe23 commented Jan 24, 2025 • edited Loading

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

1tnguyen commented Jan 28, 2025

bmhowe23 commented Jan 28, 2025

bmhowe23 commented Jan 24, 2025 •

edited

Loading