[Tips] for newbies like myself

# Tips for Cloud GPU Newbies
I burnt a lot of my compute credits for no good reason. So read this and learn from my mistakes.

## Runpod Infrastructure

1. **Network volumes are datacenter-locked.** Your volume must be in the same datacenter as your pod. You can't attach a volume from US-TX-3 to a pod in EUR-IS-1. Check your pod's datacenter first, then create the volume there.
2. **Don't download data on an 8xH100 pod ($21.52/hr).** I spent ~$5 downloading and training a tokenizer on a GPU pod doing zero GPU work. Prep data on a cheap pod, save to a network volume, then attach the volume to your expensive pod.
3. **Container disk is ephemeral.** When you terminate a pod, everything on container disk is gone. Save anything you need to `/runpod-volume/` or download it before terminating.
4. **GPU availability changes constantly.** Don't plan around a specific datacenter having specific GPUs. Check availability right before deploying.
5. **Blackwell GPUs (RTX Pro 6000, RTX 5090) have issues with torch.compile.** RTX Pro 6000 hung indefinitely on compilation that takes 3 minutes on H100. Stick to Hopper (H100) or Ada Lovelace (RTX 4090) for reliable torch.compile behavior.
6. **HuggingFace downloads can silently produce 0-byte files.** I only caught this when training crashed on a "corrupt" val shard. Always validate: check file sizes and verify the header bytes after download.
7. **30GB network volume is enough.** SP-1024 data is 75 training shards × ~200MB = ~15GB, plus a ~124MB val shard and a tiny tokenizer model. Even with SP-4096 or SP-8192 shards alongside SP-1024, 30GB fits comfortably. You don't need the full 195 shards, in 10 minutes of training you'll see ~7B tokens, which is ~70 shards. The rest are never touched.

## Data & Tokenizers

1. **SP-4096 and SP-8192 pre-tokenized data is publicly available** on [sproos/parameter-golf-tokenizers](https://huggingface.co/sproos/parameter-golf-tokenizers/tree/main). Don't spend an hour training a tokenizer from scratch like I did.
2. **Tokenizer training is CPU-only work** no GPU needed. But the raw `docs_selected.jsonl` is ~45GB, so you need sufficient disk space.
3. **The val shard is on HuggingFace at `datasets/datasets/fineweb10B_sp1024/` (note the double `datasets/`).** The download script handles this, but if you're downloading manually, watch the path.

## Workflow

1. **Prepare everything locally before touching Runpod.** Write your training script, fix syntax errors, validate imports. Every minute on a GPU pod costs money.
2. **Test on 1 GPU first (2-3 minutes).** Validates the architecture compiles, loss drops, and quantization works. Don't burn an 8xH100 run on untested code.
3. **The step-0 validation eval blocks for ~35 minutes on 1 GPU.** Skip it for POC runs by setting `VAL_LOSS_EVERY` to a high number, or add a flag to skip the initial eval.
4. **torch.compile takes 3-4 minutes on H100.** The baseline's warmup phase (20 steps) triggers compilation, then resets the model weights and timer. The 600-second wallclock only counts actual training steps. Compile time is "free", it typically adds 3-4 minutes before training starts, but doesn't reduce your training budget.


## SSH & Remote Debugging

1. **Runpod's SSH gateway prints `Error: Your SSH client doesn't support PTY` even when commands succeed.** The command executes fine, the error is from their proxy, not your client. Use `SSH over exposed TCP` (direct connection) instead of the gateway SSH to avoid this entirely.
2. **Use the direct TCP SSH, not the gateway SSH.** Gateway: `ssh pod-id@ssh.runpod.io` (broken output). Direct: `ssh root@<ip> -p <port>` (works perfectly, also supports SCP/SFTP for file transfer).
3. **`pip install` fails on Runpod containers with PEP 668.** Use `pip install --break-system-packages` or create a venv. The error message about "externally-managed-environment" is misleading, you're root in a container, it's fine.
4. **Always install `sentencepiece` and `huggingface_hub` before running.** They're not in the default Runpod PyTorch image. Add them to your setup command.
5. **`torchrun` on 1 GPU still works** it just sets `world_size=1`. Good for POC testing. No need for the `--standalone --nproc_per_node=8` flags on single GPU, plain `python` works.
6. **Add a `USE_COMPILE=0` env var toggle to your script.** torch.compile is essential for competition speed but takes 3-4 minutes and blocks debugging. A one-line check like `if int(os.environ.get("USE_COMPILE", "1")): model = torch.compile(model)` saves you time during POC testing on cheap GPUs.
7. **Watch for Windows line endings (CRLF).** If you develop on Windows and deploy to Linux, git may convert line endings. Add `* text=auto` to `.gitattributes` or ensure your editor uses LF.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tips] for newbies like myself #82

Tips for Cloud GPU Newbies

Runpod Infrastructure

Data & Tokenizers

Workflow

SSH & Remote Debugging

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tips] for newbies like myself #82

Description

Tips for Cloud GPU Newbies

Runpod Infrastructure

Data & Tokenizers

Workflow

SSH & Remote Debugging

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions