You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Client[7] in group[0] is exiting...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/workspace/test/dgl/examples/distributed/graphsage/node_classification.py", line 467, in <module>
[rank0]: main(args)
[rank0]: File "/home/ubuntu/workspace/test/dgl/examples/distributed/graphsage/node_classification.py", line 408, in main
[rank0]: epoch_time, test_acc = run(args, device, data)
[rank0]: File "/home/ubuntu/workspace/test/dgl/examples/distributed/graphsage/node_classification.py", line 233, in run
[rank0]: model = th.nn.parallel.DistributedDataParallel(model)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 873, in __init__
[rank0]: optimize_ddp = torch._dynamo.config._get_optimize_ddp_mode()
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank0]: return importlib.import_module(f".{name}", __name__)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]: return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]: File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank0]: File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank0]: File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank0]: File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank0]: File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank0]: torch.manual_seed = disable(torch.manual_seed)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/test/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank0]: obj = _ForkingPickler.dumps(obj)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank0]: cls(buf, protocol).dump(obj)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank0]: from torch.nested._internal.nested_tensor import NestedTensor
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank0]: _nt_view_dummy = NestedTensor(
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank0]: torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank0]: return importlib.import_module(f".{name}", __name__)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]: return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank0]: torch.manual_seed = disable(torch.manual_seed)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank0]: return DisableContext()(fn)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank0]: (filename is None or trace_rules.check(fn))
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank0]: return check_verbose(obj, is_inlined_call).skipped
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank0]: rule = torch._dynamo.trace_rules.lookup_inner(
[rank0]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank0]: return DisableContext()(fn)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank0]: (filename is None or trace_rules.check(fn))
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank0]: return check_verbose(obj, is_inlined_call).skipped
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank0]: rule = torch._dynamo.trace_rules.lookup_inner(
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3442, in lookup_inner
[rank0]: rule = get_torch_obj_rule_map().get(obj, None)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2782, in get_torch_obj_rule_map
[rank0]: obj = load_object(k)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2811, in load_object
[rank0]: val = _load_obj_from_str(x[0])
[rank0]: File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2795, in _load_obj_from_str
[rank0]: return getattr(importlib.import_module(module), obj_name)
[rank0]: File "/opt/conda/envs/test/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]: return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]: File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
[rank0]: File "<frozen importlib._bootstrap>", line 171, in __enter__
[rank0]: File "<frozen importlib._bootstrap>", line 116, in acquire
[rank0]: _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('torch.nested._internal.nested_tensor') at 139822324158512
Expected behavior
if num_samplers = 0, launch.py can run normally as follows:
ip-172-31-8-126: Initializing PyTorch process group.
ip-172-31-8-126: Initializing PyTorch process group.
ip-172-31-8-78: Initializing PyTorch process group.
ip-172-31-8-78: Initializing PyTorch process group.
ip-172-31-8-126: Initializing DistGraph.
ip-172-31-8-126: Initializing DistGraph.
ip-172-31-8-78: Initializing DistGraph.
ip-172-31-8-78: Initializing DistGraph.
Rank of ip-172-31-8-126: 1
Rank of ip-172-31-8-126: 0
Rank of ip-172-31-8-78: 3
Rank of ip-172-31-8-78: 2
part 1, train: 49154 (local: 47319), val: 9831 (local: 9138), test: 553273 (local: 536468)
part 3, train: 49153 (local: 49153), val: 9830 (local: 9830), test: 553272 (local: 553272)
part 2, train: 49154 (local: 49154), val: 9831 (local: 9831), test: 553273 (local: 553273)
part 0, train: 49154 (local: 47266), val: 9831 (local: 9111), test: 553273 (local: 536743)
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Part 2 | Epoch 00001 | Step 00019 | Loss 2.9661 | Train Acc 0.2500 | Speed (samples/sec) 6013.4149 | GPU 0.0 MB | Mean step time 0.172 sPart 3 | Epoch 00001 | Step 00019 | Loss 2.9177 | Train Acc 0.2730 | Speed (samples/sec) 6048.2241 | GPU 0.0 MB | Mean step time 0.171 s
Part 1 | Epoch 00001 | Step 00019 | Loss 3.3244 | Train Acc 0.1600 | Speed (samples/sec) 7101.2867 | GPU 0.0 MB | Mean step time 0.144 s
Part 0 | Epoch 00001 | Step 00019 | Loss 3.4020 | Train Acc 0.1820 | Speed (samples/sec) 7192.1353 | GPU 0.0 MB | Mean step time 0.145 s
Part 0 | Epoch 00001 | Step 00039 | Loss 2.5745 | Train Acc 0.3300 | Speed (samples/sec) 7365.2213 | GPU 0.0 MB | Mean step time 0.137 s
Part 2 | Epoch 00001 | Step 00039 | Loss 2.0382 | Train Acc 0.4800 | Speed (samples/sec) 6019.3000 | GPU 0.0 MB | Mean step time 0.168 s
Part 3 | Epoch 00001 | Step 00039 | Loss 1.9999 | Train Acc 0.4910 | Speed (samples/sec) 6766.9660 | GPU 0.0 MB | Mean step time 0.150 s
Part 1 | Epoch 00001 | Step 00039 | Loss 2.5951 | Train Acc 0.3280 | Speed (samples/sec) 7600.3897 | GPU 0.0 MB | Mean step time 0.133 s
Part 1, Epoch Time(s): 10.8078, sample+data_copy: 3.9922, forward: 0.6804, backward: 1.3248, update: 0.0594, #seeds: 49154, #inputs: 8293615
Part 2, Epoch Time(s): 10.8265, sample+data_copy: 2.5530, forward: 0.4508, backward: 4.4623, update: 0.0752, #seeds: 49154, #inputs: 8172341Part 3, Epoch Time(s): 10.8258, sample+data_copy: 2.9187, forward: 0.5210, backward: 3.9289, update: 0.0421, #seeds: 49153, #inputs: 8155047
Environment
DGL Version (e.g., 1.0):2.4
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):torch 2.3.1+cu121
🐛 Bug
When we run dgl/tools/launch.py, it will return Failures if num_samplers = 1 and return _frozen_importlib._DeadlockError when num_samplers > 1.
To Reproduce
Steps to reproduce the behavior:
1.when num_samplers = 1:
run
python3 dgl/tools/launch.py \ --workspace dgl/examples/distributed/graphsage/ \ --num_trainers 2 \ --num_samplers 1 \ --num_servers 1 \ --part_config data/ogbn-products.json \ --ip_config ip_config.txt \ --num_omp_threads 16 \ "python node_classification.py --graph_name ogbn-products --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"
output:
Traceback (most recent call last):
2.when num_samplers > 1:
output:
Expected behavior
if num_samplers = 0, launch.py can run normally as follows:
Environment
conda
,pip
, source):pipThe text was updated successfully, but these errors were encountered: