Skip to content

[Tips] for newbies like myself #82

@jordankzf

Description

@jordankzf

Tips for Cloud GPU Newbies

I burnt a lot of my compute credits for no good reason. So read this and learn from my mistakes.

Runpod Infrastructure

  1. Network volumes are datacenter-locked. Your volume must be in the same datacenter as your pod. You can't attach a volume from US-TX-3 to a pod in EUR-IS-1. Check your pod's datacenter first, then create the volume there.
  2. Don't download data on an 8xH100 pod ($21.52/hr). I spent ~$5 downloading and training a tokenizer on a GPU pod doing zero GPU work. Prep data on a cheap pod, save to a network volume, then attach the volume to your expensive pod.
  3. Container disk is ephemeral. When you terminate a pod, everything on container disk is gone. Save anything you need to /runpod-volume/ or download it before terminating.
  4. GPU availability changes constantly. Don't plan around a specific datacenter having specific GPUs. Check availability right before deploying.
  5. Blackwell GPUs (RTX Pro 6000, RTX 5090) have issues with torch.compile. RTX Pro 6000 hung indefinitely on compilation that takes 3 minutes on H100. Stick to Hopper (H100) or Ada Lovelace (RTX 4090) for reliable torch.compile behavior.
  6. HuggingFace downloads can silently produce 0-byte files. I only caught this when training crashed on a "corrupt" val shard. Always validate: check file sizes and verify the header bytes after download.
  7. 30GB network volume is enough. SP-1024 data is 75 training shards × ~200MB = ~15GB, plus a ~124MB val shard and a tiny tokenizer model. Even with SP-4096 or SP-8192 shards alongside SP-1024, 30GB fits comfortably. You don't need the full 195 shards, in 10 minutes of training you'll see ~7B tokens, which is ~70 shards. The rest are never touched.

Data & Tokenizers

  1. SP-4096 and SP-8192 pre-tokenized data is publicly available on sproos/parameter-golf-tokenizers. Don't spend an hour training a tokenizer from scratch like I did.
  2. Tokenizer training is CPU-only work no GPU needed. But the raw docs_selected.jsonl is ~45GB, so you need sufficient disk space.
  3. The val shard is on HuggingFace at datasets/datasets/fineweb10B_sp1024/ (note the double datasets/). The download script handles this, but if you're downloading manually, watch the path.

Workflow

  1. Prepare everything locally before touching Runpod. Write your training script, fix syntax errors, validate imports. Every minute on a GPU pod costs money.
  2. Test on 1 GPU first (2-3 minutes). Validates the architecture compiles, loss drops, and quantization works. Don't burn an 8xH100 run on untested code.
  3. The step-0 validation eval blocks for ~35 minutes on 1 GPU. Skip it for POC runs by setting VAL_LOSS_EVERY to a high number, or add a flag to skip the initial eval.
  4. torch.compile takes 3-4 minutes on H100. The baseline's warmup phase (20 steps) triggers compilation, then resets the model weights and timer. The 600-second wallclock only counts actual training steps. Compile time is "free", it typically adds 3-4 minutes before training starts, but doesn't reduce your training budget.

SSH & Remote Debugging

  1. Runpod's SSH gateway prints Error: Your SSH client doesn't support PTY even when commands succeed. The command executes fine, the error is from their proxy, not your client. Use SSH over exposed TCP (direct connection) instead of the gateway SSH to avoid this entirely.
  2. Use the direct TCP SSH, not the gateway SSH. Gateway: ssh pod-id@ssh.runpod.io (broken output). Direct: ssh root@<ip> -p <port> (works perfectly, also supports SCP/SFTP for file transfer).
  3. pip install fails on Runpod containers with PEP 668. Use pip install --break-system-packages or create a venv. The error message about "externally-managed-environment" is misleading, you're root in a container, it's fine.
  4. Always install sentencepiece and huggingface_hub before running. They're not in the default Runpod PyTorch image. Add them to your setup command.
  5. torchrun on 1 GPU still works it just sets world_size=1. Good for POC testing. No need for the --standalone --nproc_per_node=8 flags on single GPU, plain python works.
  6. Add a USE_COMPILE=0 env var toggle to your script. torch.compile is essential for competition speed but takes 3-4 minutes and blocks debugging. A one-line check like if int(os.environ.get("USE_COMPILE", "1")): model = torch.compile(model) saves you time during POC testing on cheap GPUs.
  7. Watch for Windows line endings (CRLF). If you develop on Windows and deploy to Linux, git may convert line endings. Add * text=auto to .gitattributes or ensure your editor uses LF.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions