Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node multi gpu distributed load #6927

Open
rastinrastinii opened this issue Jan 6, 2025 · 2 comments
Open

Multi node multi gpu distributed load #6927

rastinrastinii opened this issue Jan 6, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@rastinrastinii
Copy link

Hi
it seems we can load model in on one node then distribute model between node and train or inference but,
imagine we have 2 node each node 2 gpu with 24GB vram each gpu.
wanna loading model like gemma2 27B. one node can not load it. it need to distribute load between node from start till end without needing to load completely in one node.(imagine none of nodes can not offload for loading complete model on one node, like llama 3.1 405B).

is there a way to do this?

@rastinrastinii rastinrastinii added the enhancement New feature or request label Jan 6, 2025
@tjruwase
Copy link
Contributor

tjruwase commented Jan 6, 2025

It sounds like you need zero stage 3 model initialization. The following links could be useful

  1. https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models
  2. HF: https://huggingface.co/docs/transformers/main/deepspeed#deepspeed

@rastinrastinii
Copy link
Author

rastinrastinii commented Jan 7, 2025

Thanks for your help. but the problem is: it seems in following code before deepspeed.init when it call pipeline, it completely load moidel in one node. is it true? then i can not load completely on one node, how use that?
https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference

# create the model
import transformers
from transformers.models.t5.modeling_t5 import T5Block

import deepspeed

pipe = pipeline("text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    tensor_parallel={"tp_size": world_size},
    dtype=torch.float,
    injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')}
)
output = pipe('Input String')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants