Where Capsule is, where it's going, and the gotchas that need cleaning up before it's not a toy.
Walk this checklist top-to-bottom on a running capsule and every item is a green light:
- Boot —
make image && make qemu;capsuledruns as PID 1, mounts /perm, brings up eth0, supervises containerd, listens on :50000 over TLS 1.3. - Auth (mTLS + adopt + EdDSA JWT) —
capsuledgenerates/perm/tls/server.{crt,key}on first boot;capsulectl adopt <host>pins the server fingerprint TOFU and enrolls the operator's pubkey via the boot-time claim window (zero keys + 30-min timer). Every subsequent RPC requires a per-call EdDSA JWT; replay-protected by a JTI cache.auth/doc.gois the source of truth. - Containers —
capsulectl apply -f→containerdpulls + runs. Host networking works. Bridge networking via CNI works with port mappings (container port ↔ host port via CNI portmap plugin). - MicroVMs — Firecracker-backed, smolvm-style: shared rootfs + per-VM OCI payload ext4 + vsock agent.
capsulectl exec -t alpine-vm -- /bin/shgives you a real interactive shell inside runc inside the VM.--serialstreams kernel+firecracker output for debugging. - MicroVM port mapping — iptables DNAT tagged with
capsule-vm:<workload>comment; teardown finds + removes. Tested with nginx on :8080 reachable via curl. - MicroVM NAT / DNS — MASQUERADE rule for
172.20.0.0/16; capsule-guest injects a default/etc/resolv.conf(1.1.1.1,8.8.8.8) into the OCI payload if absent. - Volumes, unified — thin LVs in VG
capsuleat/dev/capsule/vol-<name>, ext4 formatted. Containers mount directly; VMs attach as virtio-blk. Same block device, one mounter at a time. - Storage substrate —
/permis a plain LV in the capsule VG; user volumes are thin LVs in the siblingthinpool. VG is initialized on first boot if the PERM partition is blank. (VM payload disks are still flatten-to-ext4 today; unifying them into the thin pool is tracked in Next §2.) - Lifecycle —
workload stop / start / restart;desired_state=STOPPEDpersists across capsule reboots. - Logs + Exec — container workloads via containerd; MicroVM workloads via the vsock agent (
capsule.v1.GuestAgent). Samecapsulectlcommands, kind-transparent. workload cp— scp-style file/directory copy in and out of containers and MicroVMs (kubectl-style tar-pipe through Exec; see "Gotchas" below for the rewrite tracked).- Persistence — SQLite at
/perm/state.db. Workloads + volumes survive capsule reboots. - Reconciler spec-drift detection —
apply -fon a workload whose spec changed (image, command, env, mounts, etc.) recreates the workload instead of silently keeping the old container/VM. The drift check lives incore/reconciler/. - Image push (side-loading) —
capsulectl image push <docker-save.tar | oci-tar>streams an image into containerd's local cache. Lets workloads run without a reachable registry (air-gapped, private images, etc.).capsulectl image listshows the cache. - Per-capsule hostname —
/perm/capsule/hostnameif set; otherwise derived from the primary MAC. Survives A/B slot flips because it lives on/perm, not the rootfs. - A/B OS updates —
capsulectl capsule update push <bundle.tar>streams a new rootfs+kernel+initramfs to the inactive slot, GRUB flips active, capsule reboots; tentative-commit on first successful boot, auto-rollback to the known-good slot on health failure.update confirmlocks it in. - Breakglass / debug —
capsulectl capsule debugdeploys a privileged container with host PID + bind-mounted/dev+/sys+/perm+/sbin//usr/sbin//usr/bin, drops you into an interactive shell, cleans up on exit (or--keep). - System introspection —
capsulectl capsule info(hostname, kernel, uptime, CPU, memory, boot disk, thinpool fill %),capsulectl capsule logs -f(tailscapsuled's own slog over gRPC, no SSH needed). - Real hardware — boots end-to-end on Beelink (Intel Jasper Lake) with UEFI, AHCI/SATA M.2 SSDs, and the linux-lts 6.18 kernel. Same image boots under QEMU.
- mDNS discovery + short IDs — every capsule announces itself as
_capsule._tcp.local.withcapsule_id, stable 16-bit short ID (capsule-a3f2), adopted flag, and version.capsulectl discoverbrowses the LAN and prints a table with TLS fingerprints fetched directly per row. HDMI banner shows the short ID next to the IP. - One-pass install — boots in installer mode automatically when running on removable media (USB stick / SD card) with at least one viable internal target (non-removable, large enough, no existing Capsule MBR). In installer mode, capsuled exposes only
InstallService(unauthenticated, gated by physical USB possession — same trust model as the claim window) and skips containerd, the reconciler, scheduler, and every workload path.capsulectl install <short-id>resolves the installer via mDNS, captures + confirms the fingerprint, generates an operator keypair, and drivesInstallService.Install. The installer writes a fresh MBR (b1a570ffBLASTOFF signature), copies the boot + slot_a + slot_b partitions byte-for-byte from the USB to the target, initializes LVM on PERM, and seeds/perm/firstboot.jsonwith a pre-generated capsule_id + TLS keypair + the operator's pubkey. On first boot of the target disk, capsuled ingests the bundle, skips the claim window, and comes up already adopted as the context the CLI already saved.
Designs we've committed to on paper but not yet started building. Each is its own doc under docs/; read before touching the relevant code area.
- docs/encrypted-volumes.md — Per-volume LUKS2 with a TPM-sealed node master, recovery codes printed once at create/init. The full failure-mode matrix is in the doc.
- docs/external-disks.md — Pools as a first-class concept: attach / adopt / detach secondary and external (USB) disks, with optional whole-pool or per-volume encryption. Companion to encrypted-volumes.
- docs/pci-devices.md — Operator-registered device passthrough for containers (GPUs, FPGAs, USB-serial). v1 is containers-only; microVM passthrough waits on a non-Firecracker backend (see Next #7).
- docs/fabric.md — A WireGuard mesh between capsules with declarative per-workload allow-list policy. Stable cross-host workload IPs in
100.64.0.0/10. Default-deny. Operator-driven enrollment, no central control plane. - docs/edge.md — Exposing fabric workloads to the public internet via a capsule-managed Caddy. Direct DNS or behind a cloud LB / CDN. Builds on the fabric proposal.
- docs/live-migration.md — Operator-orchestrated migration between capsules for containers and microVMs: CRIU paths, optional backend snapshots, and LVM-thin volume transfer/cutover.
Today runtime/microvm/firecracker/driver.go launches the firecracker binary directly as a subprocess of capsuled (which runs as PID 1, root, all caps). We get KVM + Firecracker's built-in seccomp for free, and that's genuinely strong — but Firecracker ships a companion jailer binary (already installed in the image at /usr/bin/jailer) that we're not using. Wiring it up gives defense-in-depth if anyone ever finds a KVM escape:
- chroot per VM
- network / PID / mount namespaces around each VMM
- cap drops (VMM stops being root)
- cgroup memory/CPU limits on the VMM process
- extra seccomp policy layer
For single-tenant homelab the current setup is fine. This matters most if/when we run untrusted workloads, multi-tenant, or compliance-adjacent workloads. Not a rewrite — swap fc.VMCommandBuilder{}.WithBin(...) for the jailer path and configure fc.Config.JailerCfg. ~1 day of work.
Expose resources: { cpuMillis, memoryMib, pidsMax } in MicroVMSpec. Translate to linux.resources in the OCI config capsule-guest writes. runc already enforces. Kubernetes-style limits without the indirection.
Natural follow-up to the LVM thin migration — the user-volume half landed and was verified end-to-end on the VPS; the VM-payload half was explicitly deferred because of a containerd/LVM ownership clash documented below. Pick this up when you're next in the storage code; ~2 days of work on top of the groundwork already in place.
Today user volumes live in the capsule VG's thin pool, but VM payload disks still go through the pre-LVM flatten-to-ext4 path (~30 s on alpine; no block-level CoW between identical VMs). The right architecture (fly.io pattern) is to put VM payloads in the same thin pool via containerd's devmapper snapshotter so 10 identical VMs share image blocks until they write.
Blocker: containerd's devmapper snapshotter wants to own a thin pool's device-id allocation. LVM-managed pools don't expose the internal -tpool dm device usefully, and the LVM-visible pool LV rejects the thin-pool messages containerd sends. Fix is to create the thin pool via dmsetup directly (data + metadata LVs backing it, still LVM-managed, but the pool target itself is dmsetup-constructed). Then pool_name in containerd config matches a dm device containerd fully owns.
Scope: rework boot/boot_linux.go:initializeCapsuleVG to create raw thinmeta/thindata LVs and then dmsetup create capsule-thinpool over them. Re-enable devmapper as default snapshotter in image/etc/containerd-config.toml. Revert runtime/microvm/firecracker/image.go:preparePayloadDisk to the snapshot-prepare path (git history has the version — the commit that this note first appeared in also contains the snapshot-based implementation that was reverted). Re-verify end-to-end on the VPS with two identical alpine VMs: the second should boot under 5 s and lvs should show the thin pool dedup'ing extents.
Now that /perm is an LVM thin pool and every volume is a thin LV, snapshot/backup/migration is mostly plumbing over existing LVM primitives. Rough order:
- Phase B — Local snapshots.
capsulectl volume snapshot <vol> [--name v1]→lvcreate -s(instant, thin, shares extents with source).capsulectl volume snapshots list <vol>andvolume restore <vol> <snap>(creates a new LV from the snapshot). Retention rotation as a scheduled job. Same semantics as fly.io's default daily snapshots + 5-day retention. - Phase C — Offline export/import.
capsulectl volume export <vol> [--snapshot]→ snapshot then streamdd | zstdto stdout or an image file. Matchingvolume import <name> <file>. Covers the 90% "back this up somewhere" case — no new daemons, just shell-out pipelines. - Phase D — Cross-capsule live migration (dm-clone + iSCSI). Source exports the snapshot as an iSCSI LUN; destination stacks
dm-cloneover an empty LV with the iSCSI export as the remote source. VM boots immediately on the destination; blocks hydrate in the background. DISCARD pass-through on the guest short-circuits empty space. This is fly.io's machine-migration mechanism. ~2-3 weeks of real work, wait until there's actually a second capsule to migrate between.
Adjacent, scoped-out as separate proposals: at-rest encryption (docs/encrypted-volumes.md) layers LUKS2 per-volume with a TPM-sealed node master, and multi-disk pools (docs/external-disks.md) makes the second/external disk a first-class operator concept. Both are independent from this lifecycle work — read the proposals before designing snapshot/export wire formats so the same keys carry through.
Accumulated lint/dead code spotted in passing. Batch into a single cleanup PR when someone's in the area:
runtime/container/driver.go— unusederrNotFoundsentinel near EOF; unused_ = strings.ToLowerimport-suppression hack above it.boot/boot_linux.go:38—initPlatform(ctx)takes acontext.Contextthat's no longer used; either use it or drop it.boot/boot_linux.go:279— switchstrings.Split→strings.SplitSeqper analyzer (Go 1.25+ perf nit).
capsulectl workload listshould show declared port mappings (today they're only inworkload get).capsulectl workload getcould print a human-friendly summary on top of the JSON.capsulectl volume mount <name> <path>— mount a volume LV on the capsule shell for inspection without a workload.--waitflag onapply/start/restart— block until phase=Running.
capsulectl --capsule a.example.com,b.example.com workload list — sequential fan-out, tabled output. The named-capsules half of this already shipped with the auth work (~/.config/capsule/config.yaml with named contexts; --capsule resolves "name | host:port | $CAPSULE_HOST | current context"). What's missing is fan-out across multiple targets in one command, and the cross-capsule workload reachability story.
The reachability piece is its own proposal: docs/fabric.md — a WireGuard mesh between capsules with declarative per-workload allow-list policy. And the natural follow-up, docs/edge.md, exposes fabric workloads to the public internet via a Caddy-on-Capsule edge. Fleet fan-out at the CLI layer and the fabric at the network layer are independent work — either can ship first — but they meet at capsulectl fabric enroll <a> <b>, which is the first CLI verb that takes two capsules.
Reserved slots in MicroVMBackend: SMOLVM, QEMU. Adding a QEMU driver unlocks virtiofs for volume sharing (if we ever want it) and PCI passthrough (for GPU/NIC workloads on real hardware). The motivating use case and the operator-facing shape are designed in docs/pci-devices.md — v1 of that proposal is containers-only specifically because Firecracker has no PCI bus; bringing up smolvm/libkrun is what unblocks microVM passthrough.
capsulectl capsule debug currently uses alpine:3.20 and tells the operator to apk add lvm2 e2fsprogs iptables iproute2 once they're inside. That works (lvm2 etc. happily talk to the host's LVM via the bind-mounted /dev + /sys + /perm) but adds 5–10 seconds to the first session and needs network access to dl-cdn.alpinelinux.org. Replace with a small purpose-built image we publish to a registry — ghcr.io/<org>/capsule-debug:<version> — with the toolchain baked in: lvm2, e2fsprogs, iptables, iproute2, util-linux, blkid, strace, tcpdump, lsof, plus host bin compatibility (the bind-mounted host /sbin/lvs etc. work directly when the image's libdir matches Alpine's). Make the default image override-able with --image so operators can use their own. Keep the alpine fallback documented for air-gapped environments.
Things that currently work but are brittle, hacky, or "good enough for now."
- VM IP allocation is hash-of-workload-name →
172.20.254.X, X =(hash % 252) + 2. Collisions possible above ~20 VMs. Swap for a real IPAM that tracks allocations in sqlite. - MASQUERADE is a blanket rule on all traffic leaving the capsule. Fine for homelab; needs narrowing if Capsule ever lives on a shared L2.
- Port mapping doesn't consider conflicts. Two VMs both declaring
hostPort: 8080will both install DNAT rules; whichever got there first wins. Validate at Apply time. - No cross-capsule reachability for workloads. Each capsule's
br0is a private island. The design for fixing this lives in docs/fabric.md (WireGuard mesh + per-workload policy) and the public-exposure half in docs/edge.md. Both are still proposals, not implementation.
- TAP teardown race was hit and fixed: Firecracker's
m.StopVMMsends SIGTERM and returns; the TAP device was still open whenip link delran. Now wem.Wait()for the process to exit before teardownTAP, andsetupTAPhas a nuke-and-retry path for stale busy TAPs. If this bites again in a new way, look at the same area. - vsock.uds file lingers on Firecracker crash. Driver pre-unlinks
vsock.udsandapi.sockon every Start. Same for the payload disk dir. - Exec -t over vsock has no window-resize — we get the ExecResize message from the client but runc's CLI has no way to forward it mid-session. Needs a console-socket protocol. Low priority — most interactive use is
-tfor a short command; real sessions viacapsulectl execwork with 80x24. - Guest ready timeout is 60s, generous to cover contended cloud VMs. Could be smarter (e.g., adapt based on kernel boot time).
- Volume-flush on Stop depends on in-memory
d.vms[name].agent.Stopnow unmounts every user volume before returning so ext4 commits its journal before Firecracker dies — but it only fires whenDriver.Removedials the guest agent over the existing in-memoryguestConn. If capsuled was restarted (zombie VM still running, in-memoryd.vmsmap empty),Removefalls through toShutdown/StopVMMdirectly and the umount never happens → silent data loss recurs. Fix shape: haveRemovere-dial the guest agent fresh from<vmDir>/vsock.udswhen the in-memory entry is missing. Tied to the broader "reconnect to live VMs after capsuled restart" gap (today the reconciler will try to create a second VM with the same name and fail at TAP/socket conflicts).
- No concurrent-mount protection beyond what the kernel gives you.
volume listshowsMOUNTED_BYbut nothing prevents a user from declaring the same volume on two containers; the second mount fails at runtime (ext4 refuses). Enforce at Apply time. volume deleteafter a crashed workload may leave/run/capsule/mounts/<workload>/dirs.unmountContainerVolumesbest-effort cleans these.- Thin pool exhaustion is fatal. Overcommitted volumes + a guest that fills one → pool ENOSPC → writes to every thin LV in the pool start failing. Need to configure
thin_pool_autoextend_threshold/thin_pool_autoextend_percentin/etc/lvm/lvm.confand expose pool fill % as a capsule metric. Capsule doesn't yet. - Volume resize is grow-only.
resize2fscan shrink ext4 but requirese2fsck -ffirst and is dangerous; not exposed. Must be detached — therefsTocheck enforces this. - Volumes are unencrypted at rest. A stolen disk reveals everything. The fix is designed in docs/encrypted-volumes.md (LUKS2 + TPM-sealed node master + per-volume + master recovery codes); the secondary-disk story builds on top in docs/external-disks.md. Both still proposals.
- Buildkit +
mknod— overlayfs doesn't allow character-device mknod in Docker RUN, sovm-shared.ext4dev nodes are injected viadebugfsaftermkfs.ext4 -d. Requirese2fsprogs-extrain the rootfs. Works but unusual enough to be surprising. - SCP with
-C(compression) corrupts 2.7 GB disk.raw uploads on at least one macOS→Linux path we hit. Usescp(no -C) for disk.raw. Needs investigation — may be an ssh client bug. - pack.sh preserves /perm across rebuilds by extracting the existing disk.raw's partition 3 before repacking. If someone runs
make cleanthey wipe all capsule state. Make sure that's intended before ripping it out. - Firecracker is apt-less. We pull a static upstream release tarball (
v1.10.1) in the Dockerfile. Bump pins carefully —firecracker-go-sdk v1.0.0was verified against this. - Firecracker CI kernel 6.1.128 is required for
CONFIG_CGROUP_BPF=y(runc 1.2+). The old 4.14 quickstart kernel doesn't have BPF cgroup support andrunc runfails withbpf_prog_query(BPF_CGROUP_DEVICE): invalid argument. Don't swap back.
- Reconciler is serial — one Tick runs
reconcileOnefor every workload sequentially. A slow EnsureRunning (15-20s for a VM that cold-starts) blocks the rest. Acceptable at homelab scale; doesn't scale to 50+ workloads. Parallelize with a worker pool + per-workload mutex eventually. boot.ExecMuis a coarse mutex for all exec.Command calls vs the reap loop. Correct but pessimistic. Long-runningrunc execsessions (interactive shells) hold it the whole time, so orphans pile up until the exec returns. PR-level fix: waitid(WNOWAIT) peek + skip tracked PIDs.- Go's
exec.Cmd.Waitvs Wait4(-1) race was the original source of pain. The mutex is the right fix at our scale. Revisit if/when we do a PID-tracking reaper.
- No metrics. No Prometheus endpoint, no
capsulectl capsule stats. Worth adding once the fleet CLI exists. capsulectl workload events— the reconciler could write a rolling event log (applied → pulling → starting → running) visible to operators.
- mTLS + EdDSA JWT shipped — see the "What's working today" entry. Trust boundary is now the operator's pubkey + the pinned server fingerprint. Keys live at
/perm/tls/server.{crt,key}; authorized operator pubkeys live in SQLite. - PCI/GPU passthrough for containers is designed in docs/pci-devices.md but not yet implemented. Today there is no way to give a workload host hardware beyond standard
/devnodes. - Workloads run as root inside the container by default. User namespaces are doable via runc config but not wired. Acceptable for homelab; revisit if multi-tenant.
capsulectl capsule debugdeploys a privileged container with host PID + host mount + bind-mounted/dev+/sys+/perm— anyone with operator credentials can compromise the host. mTLS + the authorized-keys table is the only gate; treat operator JWT loss as full-host compromise.
Things people ask for that are intentionally out of scope for Capsule:
- Kubernetes compatibility. Capsule is a smaller shape on purpose. If you want k8s, run k3s inside a capsule.
- Autoscaling. That's the future orchestrator's job, not the capsule's.
- Web UI. Capsule is CLI / API first; a UI on top is welcome as a separate project.
- Cluster gossip / auto-discovery. Capsules never advertise themselves to a discovery bus, never auto-join a cluster, never elect leaders. The fabric proposal (docs/fabric.md) adds operator-driven peering between capsules (encrypted, point-to-point WireGuard) but is explicit about not doing gossip — peers are added by
capsulectl fabric enroll, not by broadcasting. The CLI fans out to the fleet; nothing on the network does.