Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

Closed
asdfry opened this issue Jan 13, 2025 · 8 comments
Closed
Assignees

Comments

@asdfry
Copy link

asdfry commented Jan 13, 2025

Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.

Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.

Environment
Base Image: nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version: 0.16.2

Build Log
docker_build.log

Additional Context
The problem does not occur with the newer base image nvcr.io/nvidia/pytorch:24.05-py3.

Thank you.

@loadams loadams self-assigned this Jan 13, 2025
@loadams
Copy link
Contributor

loadams commented Jan 13, 2025

Hi @asdfry - The errors appear to be from gcc, perhaps the gcc versions are different and causing issues?

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Also some of the warnings clouding the output are from not having py-cpuinfo installed, could you add that and share the log again?

@loadams
Copy link
Contributor

loadams commented Jan 21, 2025

Hi @asdfry - following up on this, could you share the full dockerfile that you're using so we can repro?

@asdfry
Copy link
Author

asdfry commented Jan 21, 2025

Hello, thank you for continuing to follow up on this.
I apologize for forgetting about this issue as I’ve been occupied with other tasks.
I’m sharing the Dockerfile and the requirements.txt that can reproduce the error below.

FROM nvcr.io/nvidia/pytorch:23.01-py3

SHELL ["/bin/bash", "-c"]

USER root

WORKDIR /root

ENV DEBIAN_FRONTEND=noninteractive

# Set env for torch (compute capability)
ENV TORCH_CUDA_ARCH_LIST=9.0

# Install packages
RUN apt update && \
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
    apt install -y git-lfs pdsh openssh-server net-tools tmux tree libaio-dev iputils-ping iproute2 libnvidia-compute-535

# Set for installation
ENV mlnx_image=MLNX_OFED_LINUX-23.10-3.2.2.0-ubuntu20.04-x86_64
ENV hpcx_image=hpcx-v2.18.1-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64

# Install mlnx ofed
RUN wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-23.10-3.2.2.0/$mlnx_image.tgz && \
    tar -xvf $mlnx_image.tgz && \
    rm $mlnx_image.tgz && \
    ./$mlnx_image/mlnxofedinstall --user-space-only --without-fw-update -q

# Install hpc-x
RUN wget http://www.mellanox.com/downloads/hpc/hpc-x/v2.18.1/$hpcx_image.tbz && \
    tar -xvf $hpcx_image.tbz && \
    rm $hpcx_image.tbz
ENV HPCX_HOME=/root/$hpcx_image

# Install python & pip and Install libraries
ENV DS_BUILD_CPU_ADAM=1
COPY requirements.txt requirements.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    python3 get-pip.py && \
    pip install --no-cache-dir -r requirements.txt

# Copy files that required for training
COPY source .
COPY configs configs
transformers==4.46.3
datasets==3.1.0
accelerate==1.0.1
nvitop==1.3.2
loguru==0.7.2
google-cloud-firestore==2.15.0
google-cloud-storage==2.14.0
jsonlines==4.0.0
peft==0.13.2
deepspeed==0.16.2

@loadams
Copy link
Contributor

loadams commented Jan 22, 2025

Hi @asdfry - thanks for the repro case. I'm able to repro it. I know that where we've used these docker base images, we started using them with 23.03, this version seems to work as well as other newer versions.

It looks like a gcc error, but I'm not sure what the specific issue is. I was able to repro with an even smaller dockerfile, removing Mellanox and HPC-x too.

Image

FROM nvcr.io/nvidia/pytorch:23.01-py3

SHELL ["/bin/bash", "-c"]

USER root

WORKDIR /root

ENV DEBIAN_FRONTEND=noninteractive

# Install packages
RUN apt update

# Install python & pip and Install libraries
ENV DS_BUILD_CPU_ADAM=1

COPY requirements.txt requirements.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    python3 get-pip.py && \
    pip install --no-cache-dir -r requirements.txt

Are you able to use a newer version of the pytorch nvcr container to resolve the problem?

@asdfry
Copy link
Author

asdfry commented Jan 23, 2025

As I mentioned in my first question, I confirmed that updating the base image version resolves the issue with installing Deepspeed.
However, the image I need must work with NVIDIA driver 525, so I was using that specific image.
Interestingly, the issue doesn’t occur when I install Deepspeed inside the container after launching it, rather than during the image build process.

[email protected]:~$ docker run -it --rm nvcr.io/nvidia/pytorch:23.01-py3 /bin/bash

=============
== PyTorch ==
=============

NVIDIA Release 23.01 (build 52269074)
PyTorch Version 1.14.0a0+44dac51

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@84ec7a9ea1c9:/workspace# pip install deepspeed
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting deepspeed
  Downloading deepspeed-0.16.3.tar.gz (1.4 MB)
     |████████████████████████████████| 1.4 MB 10.5 MB/s 
Collecting einops
  Downloading einops-0.8.0-py3-none-any.whl (43 kB)
     |████████████████████████████████| 43 kB 15.0 MB/s 
Collecting hjson
  Downloading hjson-3.1.0-py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 14.5 MB/s 
Requirement already satisfied: msgpack in /usr/local/lib/python3.8/dist-packages (from deepspeed) (1.0.4)
Collecting ninja
  Downloading ninja-1.11.1.3-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
     |████████████████████████████████| 422 kB 11.4 MB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from deepspeed) (1.22.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from deepspeed) (22.0)
Requirement already satisfied: psutil in /usr/local/lib/python3.8/dist-packages (from deepspeed) (5.9.4)
Collecting py-cpuinfo
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)
Collecting pydantic>=2.0.0
  Downloading pydantic-2.10.5-py3-none-any.whl (431 kB)
     |████████████████████████████████| 431 kB 11.0 MB/s 
Requirement already satisfied: torch in /usr/local/lib/python3.8/dist-packages (from deepspeed) (1.14.0a0+44dac51)
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from deepspeed) (4.64.1)
Collecting annotated-types>=0.6.0
  Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)
Collecting typing-extensions>=4.12.2
  Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting pydantic-core==2.27.2
  Downloading pydantic_core-2.27.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
     |████████████████████████████████| 2.0 MB 11.1 MB/s 
Requirement already satisfied: sympy in /usr/local/lib/python3.8/dist-packages (from torch->deepspeed) (1.11.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.8/dist-packages (from torch->deepspeed) (2.6.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.8/dist-packages (from sympy->torch->deepspeed) (1.2.1)
Building wheels for collected packages: deepspeed
  Building wheel for deepspeed (setup.py) ... done
  Created wheel for deepspeed: filename=deepspeed-0.16.3-py3-none-any.whl size=1549958 sha256=057c552cf5b248514aa2293002f22da03d8d6e651a73141ac8ebf19d2c59c77e
  Stored in directory: /tmp/pip-ephem-wheel-cache-au9njj9c/wheels/72/85/51/65020b7f481c0b9e013a823b05be2d297ab81a1627d4cb8666
Successfully built deepspeed
Installing collected packages: typing-extensions, pydantic-core, annotated-types, pydantic, py-cpuinfo, ninja, hjson, einops, deepspeed
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 4.4.0
    Uninstalling typing-extensions-4.4.0:
      Successfully uninstalled typing-extensions-4.4.0
  Attempting uninstall: pydantic
    Found existing installation: pydantic 1.10.4
    Uninstalling pydantic-1.10.4:
      Successfully uninstalled pydantic-1.10.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.1.7 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.10.5 which is incompatible.
spacy 3.5.0 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.10.5 which is incompatible.
confection 0.0.4 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.10.5 which is incompatible.
Successfully installed annotated-types-0.7.0 deepspeed-0.16.3 einops-0.8.0 hjson-3.1.0 ninja-1.11.1.3 py-cpuinfo-9.0.0 pydantic-2.10.5 pydantic-core-2.27.2 typing-extensions-4.12.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.2.4; however, version 24.3.1 is available.
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.

root@84ec7a9ea1c9:/workspace# ds_report
[2025-01-23 05:17:37,999] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-01-23 05:17:38,008] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
df: /root/.triton/autotune: No such file or directory
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.16.3, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.14 
shared memory (/dev/shm) size .... 64.00 MB
  [WARNING] /dev/shm size might be too small, if running in docker increase to at least --shm-size='1gb' 

@loadams
Copy link
Contributor

loadams commented Jan 23, 2025

Hi @asdfry - thanks, I saw you acknowledged that updating the version worked, but I didn't know of the hard requirement on the Nvidia driver version. I believe the reason this is working in the docker container is that you are not pre-compiling the DS_BUILD_CPU_ADAM op that is being built in the dockerfile. If I run:

DS_BUILD_CPU_ADAM=1 pip install deepspeed

I am able to repro the failure in the docker image. I'll take a look at what else could be wrong.

@loadams
Copy link
Contributor

loadams commented Jan 24, 2025

@asdfry -

I believe the issue is that CPU_ADAM needs to be compiled with --std=c++17 and in this instance it is being compiled with --std=c++14. This seems to be coming from checks within DeepSpeed (for other ops, though we should enforce similar here) where, for backwards compatibility, we set --std=c++14 if the user is on torch < 2.1. This docker container is built with 1.14, so it hits this condition and as a result compiles with c++14 and hits these errors.

I was able to use this docker image, update torch to 2.2 (for example) and build successfully. I believe that should unblock you unless you are also bound by the torch version?

@asdfry
Copy link
Author

asdfry commented Jan 24, 2025

Hello,

Thank you so much for taking the time to provide such detailed answers to my question.
Thanks to your help, I was able to resolve the issue and learn a lot in the process.

Wishing you a wonderful day!

@asdfry asdfry closed this as completed Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants