You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):Linux Ubuntu 18.04,
Python version:3.9
CUDA version:11.1
NCCL version:2.8.4
cupy version:11.1
GPU model and memory:RTX2080, 11264MiB
Alpa version: 0.2.3
TensorFlow version:
JAX version:0.3.22
To Reproduce
Steps to reproduce the behavior:
python3 -m alpa.test_install
See error
Screenshots
If applicable, add screenshots to help explain your problem.
2023-06-17 22:59:20,085 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 155.69.142.146:6379...
2023-06-17 22:59:20,120 INFO worker.py:1528 -- Connected to Ray cluster.
(raylet) [2023-06-17 22:59:27,687 E 25332 25478] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-06-17_22-09-42_273283_25013 is over 95% full, available space: 21533958144; capacity: 730542596096. Object creation will fail if spilling is required.
EException ignored in: <function PipeshardDriverExecutable.__del__ at 0x7fe295cbc940>
Traceback (most recent call last):
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 434, in __del__
2023-06-17 22:59:29,665 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=16323, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fbdcf679430>)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
mesh.delete_remote_executable(self.exec_uuid)
AttributeError: 'PipeshardDriverExecutable' object has no attribute 'exec_uuid'
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 65, in <module>
runner.run(suite())
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/runner.py", line 176, in run
test(result)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 122, in run
test(result)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 736, in __call__
return self.run(*args, **kwds)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 676, in run
self._callTestMethod(testMethod)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
method()
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 121, in __call__
self._decode_args_and_get_executable(*args))
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 191, in _decode_args_and_get_executable
executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/linear_util.py", line 309, in memoized_fun
ans = call(fun, *args)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 223, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/parallel_method.py", line 240, in compile_executable
return compile_pipeshard_executable(
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
executable = PipeshardDriverExecutable(
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 105, in __init__
task.create_resharding_communicators()
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
ray.get(task_dones)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
raise value.as_instanceof_cause()
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
executable = PipeshardDriverExecutable(
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
ray.get(task_dones)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
----------------------------------------------------------------------
Ran 2 tests in 20.923s
FAILED (errors=1)
Code snippet to reproduce the problem
Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
The text was updated successfully, but these errors were encountered:
Please describe the bug
Please describe the expected behavior
System information and environment
To Reproduce
Steps to reproduce the behavior:
Screenshots
If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
The text was updated successfully, but these errors were encountered: