Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensornet target fails to initialize in Docker image when using remote-mqpu #2537

Open
3 of 4 tasks
bmhowe23 opened this issue Jan 24, 2025 · 2 comments · May be fixed by #2565
Open
3 of 4 tasks

Tensornet target fails to initialize in Docker image when using remote-mqpu #2537

bmhowe23 opened this issue Jan 24, 2025 · 2 comments · May be fixed by #2565
Assignees

Comments

@bmhowe23
Copy link
Collaborator

bmhowe23 commented Jan 24, 2025

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

With the current Docker image, the following test fails. The source file is directly from the GitHub repo: https://github.com/NVIDIA/cuda-quantum/blob/main/docs/sphinx/snippets/python/using/cudaq/platform/sample_async_remote.py

cudaq@63a0dd312d4c:~$ wget https://raw.githubusercontent.com/NVIDIA/cuda-quantum/refs/heads/main/docs/sphinx/snippets/python/using/cudaq/platform/sample_async_remote.py
<SNIP>
cudaq@63a0dd312d4c:~$ python3 sample_async_remote.py
Number of virtual QPUs: 2
Sampling jobs launched for asynchronous processing.
cuTensorNet error CUTENSORNET_STATUS_DISTRIBUTED_FAILURE in line 92
[63a0dd312d4c:00095] *** Process received signal ***
[63a0dd312d4c:00095] Signal: Aborted (6)
[63a0dd312d4c:00095] Signal code:  (-6)
[63a0dd312d4c:00095] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe4765ce520]
[63a0dd312d4c:00095] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe4766229fc]
[63a0dd312d4c:00095] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe4765ce476]
[63a0dd312d4c:00095] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe4765b47f3]
[63a0dd312d4c:00095] [ 4] /opt/nvidia/cudaq/lib/libnvqir-tensornet.so(_Z19initCuTensornetCommPv+0x18a)[0x7fe46c03eeca]
[63a0dd312d4c:00095] [ 5] /opt/nvidia/cudaq/lib/libnvqir-tensornet.so(getCircuitSimulator_tensornet+0x16d)[0x7fe46c08d1ad]
[63a0dd312d4c:00095] [ 6] /opt/nvidia/cudaq/lib/librest-remote-platform-server.so(_ZN5cudaq23getUniquePluginInstanceIN5nvqir16CircuitSimulatorEEEPT_St17basic_string_viewIcSt11char_traitsIcEEPKc+0xe0)[0x7fe47c29f1b0]
<SNIP>

Note that _Z19initCuTensornetCommPv is in the section of code that only enabled when MPI should be initialized.

The following command is a workaround but not a true fix.

sudo mv /opt/nvidia/cudaq/lib/plugins/libcudaq-comm-plugin.so /opt/nvidia/cudaq/lib/plugins/libcudaq-comm-plugin.so.bak

Steps to reproduce the bug

See above

Expected behavior

See above

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • CUDA-Q version: 888fbfb
  • Python version:
  • C++ compiler:
  • Operating system:

Suggestions

No response

@bmhowe23 bmhowe23 changed the title Tensornet target fails to initialize in Docker image Tensornet target fails to initialize in Docker image when using remote-mqpu Jan 24, 2025
@1tnguyen 1tnguyen self-assigned this Jan 24, 2025
@1tnguyen
Copy link
Collaborator

The issue is that the CUTENSORNET_COMM_LIB environment variable in the Docker container is set to an invalid value.

This file should be built in

RUN cd "$CUQUANTUM_INSTALL_PREFIX/distributed_interfaces/" && source activate_mpi_cutn.sh
ENV CUTENSORNET_COMM_LIB="$CUQUANTUM_INSTALL_PREFIX/distributed_interfaces/libcutensornet_distributed_interface_mpi.so"

I suspect that the change to cudaq.Dockerfile in #2391 has unintentionally skipped the migration of this libcutensornet_distributed_interface_mpi.so file to the destination image.

We could either restore the file migration or unset CUTENSORNET_COMM_LIB (using the built-in comm wrapper).

@bmhowe23
Copy link
Collaborator Author

Thanks for hunting this down, @1tnguyen. For a fix PR, it would be nice to add the example given above (which already exists in our released Docker image) to our publishing pipeline tests to help prevent it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants