Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistDGL v2.4 Training Error when num_samplers>0 #7753

Open
CfromBU opened this issue Aug 28, 2024 · 4 comments
Open

DistDGL v2.4 Training Error when num_samplers>0 #7753

CfromBU opened this issue Aug 28, 2024 · 4 comments
Assignees

Comments

@CfromBU
Copy link
Collaborator

CfromBU commented Aug 28, 2024

🐛 Bug

When we run dgl/tools/launch.py, it will return Failures if num_samplers = 1 and return _frozen_importlib._DeadlockError when num_samplers > 1.

To Reproduce

Steps to reproduce the behavior:

1.when num_samplers = 1:
run python3 dgl/tools/launch.py \ --workspace dgl/examples/distributed/graphsage/ \ --num_trainers 2 \ --num_samplers 1 \ --num_servers 1 \ --part_config data/ogbn-products.json \ --ip_config ip_config.txt \ --num_omp_threads 16 \ "python node_classification.py --graph_name ogbn-products --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"

output:
Traceback (most recent call last):

File "/opt/conda/envs/test/lib/python3.10/runpy.py", line 196, in _run_module_as_main
  return _run_code(code, main_globals, None,
File "/opt/conda/envs/test/lib/python3.10/runpy.py", line 86, in _run_code
  exec(code, run_globals)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in <module>
  main()
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
  return f(*args, **kwargs)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
  run(args)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
  elastic_launch(
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
  return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
  raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
node_classification.py FAILED
------------------------------------------------------------
Failures:
[1]:
time      : 2024-08-28_07:31:12
host      : ip-172-31-8-78.us-west-2.compute.internal
rank      : 3 (local_rank: 1)
exitcode  : 1 (pid: 17020)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time      : 2024-08-28_07:31:12
host      : ip-172-31-8-78.us-west-2.compute.internal
rank      : 2 (local_rank: 0)
exitcode  : 1 (pid: 17019)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

2.when num_samplers > 1:

run ``python3 dgl/tools/launch.py \
--workspace dgl/examples/distributed/graphsage/ \
--num_trainers 2 \
--num_samplers 2 \
--num_servers 1 \
--part_config data/ogbn-products.json \
--ip_config ip_config.txt \
--num_omp_threads 16 \
"python node_classification.py --graph_name ogbn-products --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"

output:

Client[7] in group[0] is exiting...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/workspace/test/dgl/examples/distributed/graphsage/node_classification.py", line 467, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/ubuntu/workspace/test/dgl/examples/distributed/graphsage/node_classification.py", line 408, in main
[rank0]:     epoch_time, test_acc = run(args, device, data)
[rank0]:   File "/home/ubuntu/workspace/test/dgl/examples/distributed/graphsage/node_classification.py", line 233, in run
[rank0]:     model = th.nn.parallel.DistributedDataParallel(model)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 873, in __init__
[rank0]:     optimize_ddp = torch._dynamo.config._get_optimize_ddp_mode()
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank0]:     return importlib.import_module(f".{name}", __name__)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank0]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank0]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank0]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank0]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank0]:     torch.manual_seed = disable(torch.manual_seed)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank0]:     obj = _ForkingPickler.dumps(obj)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank0]:     cls(buf, protocol).dump(obj)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank0]:     from torch.nested._internal.nested_tensor import NestedTensor
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank0]:     _nt_view_dummy = NestedTensor(
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank0]:     torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank0]:     return importlib.import_module(f".{name}", __name__)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank0]:     torch.manual_seed = disable(torch.manual_seed)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank0]:     return DisableContext()(fn)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank0]:     (filename is None or trace_rules.check(fn))
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank0]:     return check_verbose(obj, is_inlined_call).skipped
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank0]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank0]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank0]:     return DisableContext()(fn)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank0]:     (filename is None or trace_rules.check(fn))
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank0]:     return check_verbose(obj, is_inlined_call).skipped
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank0]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3442, in lookup_inner
[rank0]:     rule = get_torch_obj_rule_map().get(obj, None)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2782, in get_torch_obj_rule_map
[rank0]:     obj = load_object(k)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2811, in load_object
[rank0]:     val = _load_obj_from_str(x[0])
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2795, in _load_obj_from_str
[rank0]:     return getattr(importlib.import_module(module), obj_name)
[rank0]:   File "/opt/conda/envs/test/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]:   File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
[rank0]:   File "<frozen importlib._bootstrap>", line 171, in __enter__
[rank0]:   File "<frozen importlib._bootstrap>", line 116, in acquire
[rank0]: _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('torch.nested._internal.nested_tensor') at 139822324158512

Expected behavior

if num_samplers = 0, launch.py can run normally as follows:

ip-172-31-8-126: Initializing PyTorch process group.
ip-172-31-8-126: Initializing PyTorch process group.
ip-172-31-8-78: Initializing PyTorch process group.
ip-172-31-8-78: Initializing PyTorch process group.
ip-172-31-8-126: Initializing DistGraph.
ip-172-31-8-126: Initializing DistGraph.
ip-172-31-8-78: Initializing DistGraph.
ip-172-31-8-78: Initializing DistGraph.
Rank of ip-172-31-8-126: 1
Rank of ip-172-31-8-126: 0
Rank of ip-172-31-8-78: 3
Rank of ip-172-31-8-78: 2
part 1, train: 49154 (local: 47319), val: 9831 (local: 9138), test: 553273 (local: 536468)
part 3, train: 49153 (local: 49153), val: 9830 (local: 9830), test: 553272 (local: 553272)
part 2, train: 49154 (local: 49154), val: 9831 (local: 9831), test: 553273 (local: 553273)
part 0, train: 49154 (local: 47266), val: 9831 (local: 9111), test: 553273 (local: 536743)
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Part 2 | Epoch 00001 | Step 00019 | Loss 2.9661 | Train Acc 0.2500 | Speed (samples/sec) 6013.4149 | GPU 0.0 MB | Mean step time 0.172 sPart 3 | Epoch 00001 | Step 00019 | Loss 2.9177 | Train Acc 0.2730 | Speed (samples/sec) 6048.2241 | GPU 0.0 MB | Mean step time 0.171 s

Part 1 | Epoch 00001 | Step 00019 | Loss 3.3244 | Train Acc 0.1600 | Speed (samples/sec) 7101.2867 | GPU 0.0 MB | Mean step time 0.144 s
Part 0 | Epoch 00001 | Step 00019 | Loss 3.4020 | Train Acc 0.1820 | Speed (samples/sec) 7192.1353 | GPU 0.0 MB | Mean step time 0.145 s
Part 0 | Epoch 00001 | Step 00039 | Loss 2.5745 | Train Acc 0.3300 | Speed (samples/sec) 7365.2213 | GPU 0.0 MB | Mean step time 0.137 s
Part 2 | Epoch 00001 | Step 00039 | Loss 2.0382 | Train Acc 0.4800 | Speed (samples/sec) 6019.3000 | GPU 0.0 MB | Mean step time 0.168 s
Part 3 | Epoch 00001 | Step 00039 | Loss 1.9999 | Train Acc 0.4910 | Speed (samples/sec) 6766.9660 | GPU 0.0 MB | Mean step time 0.150 s
Part 1 | Epoch 00001 | Step 00039 | Loss 2.5951 | Train Acc 0.3280 | Speed (samples/sec) 7600.3897 | GPU 0.0 MB | Mean step time 0.133 s
Part 1, Epoch Time(s): 10.8078, sample+data_copy: 3.9922, forward: 0.6804, backward: 1.3248, update: 0.0594, #seeds: 49154, #inputs: 8293615
Part 2, Epoch Time(s): 10.8265, sample+data_copy: 2.5530, forward: 0.4508, backward: 4.4623, update: 0.0752, #seeds: 49154, #inputs: 8172341Part 3, Epoch Time(s): 10.8258, sample+data_copy: 2.9187, forward: 0.5210, backward: 3.9289, update: 0.0421, #seeds: 49153, #inputs: 8155047

Environment

  • DGL Version (e.g., 1.0):2.4
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):torch 2.3.1+cu121
  • OS (e.g., Linux):Ubuntu 20.04
  • How you installed DGL (conda, pip, source):pip
  • Build command you used (if compiling from source):pip install --pre dgl -f https://data.dgl.ai/wheels-test/torch-2.3/cu121/repo.html
  • Python version:3.10.14
  • CUDA/cuDNN version (if applicable):12.2
  • GPU models and configuration (e.g. V100):Tesla T4
  • Any other relevant information:
@CfromBU
Copy link
Collaborator Author

CfromBU commented Aug 28, 2024

dgl v2.1.0 can run normally, but dgl v2.2.0 returns this error.

@sg0
Copy link

sg0 commented Aug 28, 2024

More context here: https://discuss.dgl.ai/t/distdgl-v2-4-built-from-source-training-error/4474/12

@CfromBU
Copy link
Collaborator Author

CfromBU commented Aug 30, 2024

GraphSAGE + ogbn-products + num_samplers = 2

dgl v2.3+torch v2.1, works well.
dgl v2.3+torch v2.2, not tested.
dgl v2.3+torch v2.3, crashed with this issue.
dgl v2.4+torch v2.4, works well.

Copy link

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants