-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Description
Tips for Cloud GPU Newbies
I burnt a lot of my compute credits for no good reason. So read this and learn from my mistakes.
Runpod Infrastructure
- Network volumes are datacenter-locked. Your volume must be in the same datacenter as your pod. You can't attach a volume from US-TX-3 to a pod in EUR-IS-1. Check your pod's datacenter first, then create the volume there.
- Don't download data on an 8xH100 pod ($21.52/hr). I spent ~$5 downloading and training a tokenizer on a GPU pod doing zero GPU work. Prep data on a cheap pod, save to a network volume, then attach the volume to your expensive pod.
- Container disk is ephemeral. When you terminate a pod, everything on container disk is gone. Save anything you need to
/runpod-volume/or download it before terminating. - GPU availability changes constantly. Don't plan around a specific datacenter having specific GPUs. Check availability right before deploying.
- Blackwell GPUs (RTX Pro 6000, RTX 5090) have issues with torch.compile. RTX Pro 6000 hung indefinitely on compilation that takes 3 minutes on H100. Stick to Hopper (H100) or Ada Lovelace (RTX 4090) for reliable torch.compile behavior.
- HuggingFace downloads can silently produce 0-byte files. I only caught this when training crashed on a "corrupt" val shard. Always validate: check file sizes and verify the header bytes after download.
- 30GB network volume is enough. SP-1024 data is 75 training shards × ~200MB = ~15GB, plus a ~124MB val shard and a tiny tokenizer model. Even with SP-4096 or SP-8192 shards alongside SP-1024, 30GB fits comfortably. You don't need the full 195 shards, in 10 minutes of training you'll see ~7B tokens, which is ~70 shards. The rest are never touched.
Data & Tokenizers
- SP-4096 and SP-8192 pre-tokenized data is publicly available on sproos/parameter-golf-tokenizers. Don't spend an hour training a tokenizer from scratch like I did.
- Tokenizer training is CPU-only work no GPU needed. But the raw
docs_selected.jsonlis ~45GB, so you need sufficient disk space. - The val shard is on HuggingFace at
datasets/datasets/fineweb10B_sp1024/(note the doubledatasets/). The download script handles this, but if you're downloading manually, watch the path.
Workflow
- Prepare everything locally before touching Runpod. Write your training script, fix syntax errors, validate imports. Every minute on a GPU pod costs money.
- Test on 1 GPU first (2-3 minutes). Validates the architecture compiles, loss drops, and quantization works. Don't burn an 8xH100 run on untested code.
- The step-0 validation eval blocks for ~35 minutes on 1 GPU. Skip it for POC runs by setting
VAL_LOSS_EVERYto a high number, or add a flag to skip the initial eval. - torch.compile takes 3-4 minutes on H100. The baseline's warmup phase (20 steps) triggers compilation, then resets the model weights and timer. The 600-second wallclock only counts actual training steps. Compile time is "free", it typically adds 3-4 minutes before training starts, but doesn't reduce your training budget.
SSH & Remote Debugging
- Runpod's SSH gateway prints
Error: Your SSH client doesn't support PTYeven when commands succeed. The command executes fine, the error is from their proxy, not your client. UseSSH over exposed TCP(direct connection) instead of the gateway SSH to avoid this entirely. - Use the direct TCP SSH, not the gateway SSH. Gateway:
ssh pod-id@ssh.runpod.io(broken output). Direct:ssh root@<ip> -p <port>(works perfectly, also supports SCP/SFTP for file transfer). pip installfails on Runpod containers with PEP 668. Usepip install --break-system-packagesor create a venv. The error message about "externally-managed-environment" is misleading, you're root in a container, it's fine.- Always install
sentencepieceandhuggingface_hubbefore running. They're not in the default Runpod PyTorch image. Add them to your setup command. torchrunon 1 GPU still works it just setsworld_size=1. Good for POC testing. No need for the--standalone --nproc_per_node=8flags on single GPU, plainpythonworks.- Add a
USE_COMPILE=0env var toggle to your script. torch.compile is essential for competition speed but takes 3-4 minutes and blocks debugging. A one-line check likeif int(os.environ.get("USE_COMPILE", "1")): model = torch.compile(model)saves you time during POC testing on cheap GPUs. - Watch for Windows line endings (CRLF). If you develop on Windows and deploy to Linux, git may convert line endings. Add
* text=autoto.gitattributesor ensure your editor uses LF.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels