- The DigitalOcean DCGM-Exporter exporter is assumed to run on DigitalOcean droplets.
- Nvidia drivers must be installed. Verify that the binary
nvidia-smi
is available and can discover GPUs and NVSwitches.- When using the default OS base image for GPU droplet (currently named
AI/ML ready
), the NVIDIA drivers are already preinstalled.
- When using the default OS base image for GPU droplet (currently named
# output for a droplet with a single H100 GPU
root@myGPUDroplet:~# nvidia-smi
Tue Feb 11 21:20:03 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:00:09.0 Off | 0 |
| N/A 29C P0 73W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
- NVIDIA Data Center GPU Manager (DCGM) must be installed
- Installation of DCGM in a DigitalOcean Droplet usually simply involves the following commands
$ sudo apt install datacenter-gpu-manager
$ sudo systemctl --now enable nvidia-dcgm
# output for a droplet with a single H100 GPU
root@myGPUDroplet:~# dcgmi discovery -l
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:00:09.0 |
| | Device UUID: GPU-abc |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
Please see build.md.
Download the apt
package in .deb
format for your OS version
- from the Github release page: https://github.com/digitalocean/do-dcgm-exporter/releases
- or from Github pages (see below)
OS_RELEASE_VERSION=jammy # focal, jammy, noble
DO_DCGM_EXPORTER_VERSION=0.0.1
wget https://digitalocean.github.io/do-dcgm-exporter/ubuntu/pool/${OS_RELEASE_VERSION}/do-dcgm-exporter_${DO_DCGM_EXPORTER_VERSION}_amd64-${OS_RELEASE_VERSION}.deb
Next, install the package from the local filesystem
- sets up
/etc/apt/sources.list.d/do-dcgm-exporter.list
for future package upgrades - creates
/etc/systemd/system/do-dcgm-exporter.service
$ sudo apt install do-dcgm-exporter_${DO_DCGM_EXPORTER_VERSION}_amd64-${OS_RELEASE_VERSION}.deb
Then start the do-dcgm-exporter.service
service
$ sudo systemctl daemon-reload
$ sudo systemctl start do-dcgm-exporter.service
Check if the service is running
$ sudo systemctl status do-dcgm-exporter
● do-dcgm-exporter.service - DigitalOcean DCGM Exporter
Loaded: loaded (/etc/systemd/system/do-dcgm-exporter.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2025-02-07 22:12:52 UTC; 47s ago
Main PID: 596393 (do-dcgm-exporte)
Tasks: 15 (limit: 289778)
Memory: 32.8M
CPU: 37ms
CGroup: /system.slice/do-dcgm-exporter.service
└─596393 /opt/digitalocean/bin/do-dcgm-exporter
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Initializing system entities of type: NvSwitch"
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Not collecting NvSwitch metrics: no switches to monitor"
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Initializing system entities of type: NvLink"
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Not collecting NvLink metrics: no switches to monitor"
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Not collecting CPU metrics: no fields to watch for devi>
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Not collecting CPU Core metrics: no fields to watch for>
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Pipeline starting"
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Starting webserver"
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="Listening on" address="[::]:9401"
Feb 07 22:12:52 my-droplet do-dcgm-exporter[596393]: time="2025-02-07T22:12:52Z" level=info msg="TLS is disabled." address="[::]:9401" http2=false