Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what devices are supported? #155

Open
Guodanding opened this issue Oct 8, 2024 · 14 comments
Open

what devices are supported? #155

Guodanding opened this issue Oct 8, 2024 · 14 comments

Comments

@Guodanding
Copy link

Guodanding commented Oct 8, 2024

hello, i have tried to use megablocks in V100 + pytorch2.4.0+cu121, but get error with "cannot support bf16". If i use megablocks in fp32, i get error "group gemm must use bf16". So i change my enviroment to A6000, and then get error:

[rank1]:   File "/usr/local/lib/python3.11/dist-packages/megablocks/ops/sort.py", line 34, in forward
[rank1]:     ops.sort(x, end_bit, x_out, iota_out)
[rank1]: RuntimeError: no kernel image is available for execution on the device

So i am wondering what devices are supported? Only A100/A800 and H100/H800 ?

@mvpatel2000
Copy link
Contributor

We have only tested on A100/H100 at this time. It is possible with some tweaks to likely get it working on other hardware

@Guodanding
Copy link
Author

ok, thanks!I will check it.

@Guodanding Guodanding reopened this Oct 9, 2024
@Guodanding
Copy link
Author

After i change my env to A100,the issue still exists. I have no idea how to do.

[rank0]:   File "/usr/local/lib/python3.11/dist-packages/megablocks/layers/moe.py", line 468, in forward
[rank0]:     out = self.experts(x, scores, expert_weights, top_experts)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/megablocks/layers/moe.py", line 429, in forward
[rank0]:     x, tokens_per_expert = self.forward_fn(x, expert_weights, top_experts)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/megablocks/layers/moe.py", line 262, in parallel_forward_once
[rank0]:     indices, bin_ids, bins, tokens_per_expert = (self.indices_and_bins(top_experts))
[rank0]:                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/megablocks/layers/moe.py", line 161, in indices_and_bins
[rank0]:     output = ops.sort(top_expert, self.sort_end_bit)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/autograd/function.py", line 574, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/megablocks/ops/sort.py", line 34, in forward
[rank0]:     ops.sort(x, end_bit, x_out, iota_out)
[rank0]: RuntimeError: no kernel image is available for execution on the device

this is my env:
nv

root@I1cedf606ff00701a02:~/workspace/DynamicMoE# nvidia-smi
Wed Oct  9 11:16:58 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:40:00.0 Off |                    0 |
| N/A   34C    P0              36W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

pytorch:

Name: torch
Version: 2.4.0+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /usr/local/lib/python3.11/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: megablocks, stanford-stk, torchaudio, torchvision

and i have downloaded the group-gemm utils:

pip install git+https://github.com/tgale96/grouped_gemm@main

@Guodanding
Copy link
Author

After I reinstall megablock from

pip install megablocks[all]

instead of installing form source(what i have done), the issue disappears.

@mvpatel2000
Copy link
Contributor

Hm... maybe something is wrong with grouped gemm wheels...? @tgale96 any ideas here

@Guodanding
Copy link
Author

ok!
And what is the differences of "grouped" impl and "sparse" impl? Which one gains better throughput?

@mvpatel2000
Copy link
Contributor

They are different implementations for a groupedmatmul. Grouped is a series of iterated kernel calls with standard matmuls, whereas sparse is the strategy described in MegaBlocks https://arxiv.org/abs/2211.15841

@Guodanding
Copy link
Author

Guodanding commented Oct 12, 2024

But why recommend using grouped one instead of sparse one for H-GPUs?

Installing megablocks[gg] enables dMoE computation with grouped GEMM. 
This feature is enabled by setting the mlp_impl argument to grouped. 
This is currently our recommended path for Hopper-generation GPUs.

@mvpatel2000
Copy link
Contributor

We have observed better performance on H100s with grouped-gemm. This may not be true with the latest versions of triton

@Guodanding
Copy link
Author

Thanks!

@Guodanding
Copy link
Author

Guodanding commented Oct 13, 2024

There is another issue.
i haved tried dMoE fwd, which pass in my 1xA100 device but stop(not break) in my 2xA6000 device. Both are launch by

torchrun --standalone --nnodes=1 --nproc-per-node=[1 or 2] [my python script]

And this is my python script:

from megablocks import dMoE
from megablocks import Arguments

import os

import torch
import torch.distributed as dist

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
print(f"Running example on {rank=} in a world with {world_size=}")
dist.init_process_group("nccl")
torch.cuda.set_device(rank)
num_node_devices = torch.cuda.device_count()
rank_lists = list(range(0, num_node_devices))
ep_groups = dist.new_group(rank_lists)
device = f'cuda:{rank}'

args = Arguments()
args.moe_expert_model_parallelism = True
args.expert_parallel_group  = ep_groups
args.device = device
args.bf16 = True
args.fp16 = False
args.mlp_impl = 'grouped'

model = dMoE(args)
model.to(device)

print("build up model success!")

x = torch.rand((2, 4, 1024), dtype=torch.float32).to(torch.bfloat16)
x = x.to(device)

torch.manual_seed(42)
torch.cuda.manual_seed(42)
logits = model(x)

if rank == 0:
    print("didn't crash yay!")

I try to debug, and find the the code process stop there:

# megablocks/layers/moe.py --> class ParallelMLP --> func parallel_forward_once --> lines 288

        # This view updates the shape of the tensor from [sl, bs, hs] to
        # [sl * bs, hs] prior to the permutation.
        x = x.view(-1, x.shape[-1])
        print("pass!")
        output = ops.gather(x, indices, bin_ids, bins, self.top_k)
        print("break!")
        assert output is not None
        x = output

the output is

Running example on rank=1 in a world with world_size=2
Running example on rank=0 in a world with world_size=2
build up model success!
build up model success!
pass!
pass!

Is there anything wrong with ops.gather? Or more information needed?

@Guodanding
Copy link
Author

OK, i have tried it in 2xA100 device and the issue disappears. It seems that some tweaks are needed to adapt it to devices other than A100/H100.

@mvpatel2000
Copy link
Contributor

Hm... I'm not super sure what happens if you are on one GPU -- it might be some error with torch dist initialization that is buried. I unfortunately do not have other GPUs to test on, but if you manage to narrow it down happy to help fix.

@Guodanding
Copy link
Author

ok, i will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants