Skip to content

Create installation VM and run bootc install inside a VM using rootless podman #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

alicefr
Copy link

@alicefr alicefr commented Jun 16, 2025

This PR adds a new install commands in order to run bootc command using rootless podman inside a VM. First, the make image builds a container image including the virtualization stack.

The install command creates a container with libvirt and boot the VM from the bootc image. Together with the VM, it spans a proxy for creating a unix socket and connecting to the VSOCK port of the installation VM. Once, the connection is establish it is able to remote control the podman inside the VM and execute the bootc installation.

The output disk image is passed to the VM and identified by the device /dev/virtio-output. This path remains constant.

Note
The image used by the install command hasn't been released yet (since this is part of this PR), so you need to build it locally before being able to run the command. You can build it by runnig make image

Example:

$ make image
$ podman-bootc install --log-level debug --bootc-image quay.io/fedora/fedora-bootc:42 \
--output-dir $(pwd)/output --output-image output.qcow2  --config-dir $(pwd)/config \
-- bootc install to-disk /dev/disk/by-id/virtio-output --wipe


Installing image: docker://localhost/demo:latest
Digest: sha256:1de15d5f907b2b790850b61225f6beda334b9fccd730caf4e6d61f6ee5620fc4
Wiping /dev/vdb1
Wiping device /dev/vdb1
Wiping /dev/vdb2
Wiping device /dev/vdb2
/dev/vdb2: 8 bytes were erased at offset 0x00000052 (vfat): 46 41 54 33 32 20 20 20
/dev/vdb2: 1 byte was erased at offset 0x00000000 (vfat): eb
/dev/vdb2: 2 bytes were erased at offset 0x000001fe (vfat): 55 aa
Wiping /dev/vdb3
Wiping device /dev/vdb3
/dev/vdb3: 4 bytes were erased at offset 0x00000000 (xfs): 58 46 53 42
Wiping /dev/disk/by-id/virtio-output
Wiping device /dev/disk/by-id/virtio-output
/dev/disk/by-id/virtio-output: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
/dev/disk/by-id/virtio-output: 8 bytes were erased at offset 0x27ffffe00 (gpt): 45 46 49 20 50 41 52 54
/dev/disk/by-id/virtio-output: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/disk/by-id/virtio-output: calling ioctl to re-read partition table: Success
Block setup: direct
       Size: 10737418240
     Serial: <unknown>
      Model: <unknown>
Checking that no-one is using this disk right now ... OK

Disk /dev/vdb: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: 6F1895B6-024E-45C6-9BC9-494E4F65DF24).
/dev/vdb1: Created a new partition 1 of type 'BIOS boot' and of size 1 MiB.
/dev/vdb2: Created a new partition 2 of type 'EFI System' and of size 512 MiB.
/dev/vdb3: Created a new partition 3 of type 'Linux filesystem' and of size 9.5 GiB.
/dev/vdb4: Done.

New situation:
Disklabel type: gpt
Disk identifier: 6F1895B6-024E-45C6-9BC9-494E4F65DF24

Device       Start      End  Sectors  Size Type
/dev/vdb1     2048     4095     2048    1M BIOS boot
/dev/vdb2     4096  1052671  1048576  512M EFI System
/dev/vdb3  1052672 20971486 19918815  9.5G Linux filesystem

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
Creating root filesystem (xfs) on device /dev/vdb3 (size=10.2 GB)
> mkfs.xfs -f -m uuid=94206f76-a826-4238-b526-aa5a029fb260 -L root /dev/vdb3
Creating ESP filesystem
> mkfs.fat /dev/vdb2 -n EFI-SYSTEM
Initializing ostree layout
layers already present: 0; layers needed: 67 (1.7 GB)


Deploying container image...done (11 seconds)
Running bootupctl to install bootloader
> bootupctl backend install --write-uuid --update-firmware --auto --device /dev/vdb /run/bootc/mounts/rootfs
Installed: grub.cfg
Trimming root
.: 9.5 GiB (10198429696 bytes) trimmed
Finalizing filesystem root
Unmounting filesystems
Installation complete!

Main differences with podman machine:

  • It is a configured VM exactly for bootc. For example, it mounts on boot the expected shared mount points with the host
  • Shares the container storage from the host. Useful for caching the container images and eventually sharing among multiple installation VMs.
  • The disk image is smaller then the one provided by podman machine ~1G against 684 MB. Eventually, we could trim down the size further by building an appliance with supermin.
  • Use libvirt API for creating the VM
  • Use vock for the socket communication and it has a serial console. No ssh into the VM and need to setup an ssh server.
    podman-bootc

Copy link

sourcery-ai bot commented Jun 16, 2025

Reviewer's Guide

This PR introduces an install command that builds and extracts a VM image as a standard container, defines and launches a libvirt-managed VM with virtiofs and vsock mounts, and orchestrates running the bootc install workflow inside that VM via rootless Podman.

Sequence diagram for the new install command workflow

sequenceDiagram
    actor User
    participant CLI as podman-bootc install
    participant Podman
    participant VMBuilder as VM Image Builder
    participant Libvirt
    participant VM
    participant Proxy as VSOCK Proxy

    User->>CLI: Run 'podman-bootc install ... -- bootc install ...'
    CLI->>Podman: Build VM image container (make image)
    CLI->>Podman: Extract disk image from container
    Podman-->>CLI: Disk image extracted
    CLI->>Libvirt: Define and start VM with disk image
    Libvirt->>VM: Launch VM with virtiofs and vsock
    CLI->>Proxy: Start vsock proxy (unix <-> vsock)
    Proxy->>VM: Bridge unix socket to VM's vsock port
    CLI->>Podman: Connect to Podman API inside VM via proxy
    Podman->>VM: Run 'bootc install' inside VM
    VM-->>Podman: bootc install output
    Podman-->>CLI: Installation complete
    CLI-->>User: Output results
Loading

File-Level Changes

Change Details Files
Add Makefile support for building VM image
  • Define registry and VM image variables
  • Add image target to build VM image via Podman
Makefile
Include libvirtxml dependency
  • Add libvirt.org/go/libvirtxml to module dependencies
go.mod
Implement Podman client bindings and disk extraction
  • Create and export containers to extract embedded disk
  • Unpack disk archive via tar reader
  • Connect and retry against Podman socket
  • Run bootc inside containers and stream logs
  • Provide default Podman socket and storage paths
pkg/podman/podman.go
Implement VM domain builder and disk info parsing
  • Define domain options for memory, CPU, disks, filesystems, vsock
  • Use qemu-img JSON output to detect disk format
pkg/vm/domain/domain.go
Add install command for VM-based bootc installation
  • Parse CLI flags and filter command line args
  • Extract disk image and detect formats
  • Instantiate and run InstallVM, then stop on exit
  • Start vsock proxy and execute bootc install inside VM
cmd/install.go
Introduce vsock proxy for host-VM communication
  • Listen on host Unix socket
  • Bridge data between Unix socket and VM vsock
  • Handle shutdown and cleanup of proxy
pkg/vsock/proxy.go
Implement InstallVM lifecycle via libvirt
  • Generate random domain names and UUIDs
  • Marshal and define domain XML
  • Start, destroy, and undefine VM domains
pkg/vm/installvm.go
Add Fedora-based Containerfile for VM image creation
  • Download and verify Fedora cloud QCOW2 image
  • Customize disk with guestfs (install packages, fstab entries, services)
  • Embed final disk.img into a scratch container
containerfiles/vm/Containerfile
Add generic pointer utility
  • Implement Ptr[T] helper to return a pointer from a value
pkg/utils/pointer.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@alicefr alicefr marked this pull request as draft June 16, 2025 11:51
@alicefr
Copy link
Author

alicefr commented Jun 16, 2025

@germag @cgwalters I added a first PR to create the installation VM (not for testing it yet). I'd love to have your opinion on the approach. I still find the install command pretty lenghty, and also the fact that the user needs to specify the full bootc commandline after the -- isn't ideal IMO.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alicefr - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Inverted error check on Podman connection (link)
  • Error message formatting has mismatched verbs and arguments (link)

General comments:

  • In ExtractDiskImage the Podman connection error check is inverted (if err == nil { return err }), so it never returns real errors—change it to if err != nil.
  • The temporary container created for exporting the VM disk is never removed, leading to dangling containers; add a cleanup step (e.g. containers.Remove) after export.
  • fetchLogsAfterExit wraps a non-zero exit code in an error but never returns it, so failures are ignored—update it to return that error to propagate bootc errors.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In ExtractDiskImage the Podman connection error check is inverted (`if err == nil { return err }`), so it never returns real errors—change it to `if err != nil`.
- The temporary container created for exporting the VM disk is never removed, leading to dangling containers; add a cleanup step (e.g. containers.Remove) after export.
- fetchLogsAfterExit wraps a non-zero exit code in an error but never returns it, so failures are ignored—update it to return that error to propagate bootc errors.

## Individual Comments

### Comment 1
<location> `pkg/podman/podman.go:86` </location>
<code_context>
+		return err
+	}
+	ctx, err := bindings.NewConnection(context.Background(), fmt.Sprintf("unix:%s", socketPath))
+	if err == nil {
+		return err
+	}
</code_context>

<issue_to_address>
Inverted error check on Podman connection

The check should be `if err != nil { return err }` to handle errors correctly.
</issue_to_address>

### Comment 2
<location> `pkg/podman/podman.go:108` </location>
<code_context>
+	if err := extractTar(pr, dir); err != nil {
+		return err
+	}
+	log.Debugf("Extracted disk at: %v", dir)
+	return nil
+}
</code_context>

<issue_to_address>
Dangling Podman container not cleaned up

Please remove the temporary container after exporting its filesystem to prevent resource leaks.
</issue_to_address>

### Comment 3
<location> `pkg/podman/podman.go:226` </location>
<code_context>
+	if err != nil {
+		return fmt.Errorf("failed to wait for container: %w", err)
+	}
+	if exitCode != 0 {
+		fmt.Errorf("bootc command failed: %d", exitCode)
+	}
+
</code_context>

<issue_to_address>
Non-zero exit code error isn’t returned

The error should be returned to ensure failures are not ignored. Replace with `return fmt.Errorf(...)`.
</issue_to_address>

### Comment 4
<location> `pkg/podman/podman.go:248` </location>
<code_context>
+		return fmt.Errorf("failed to start the bootc container: %v", err)
+	}
+
+	if err := fetchLogsAfterExit(ctx, name); err != nil {
+		return fmt.Errorf("failed executing bootc: %s %s: %v", err)
+	}
+
</code_context>

<issue_to_address>
Error message formatting has mismatched verbs and arguments

There are three placeholders in the format string, but only one argument is provided. Update the format string or supply the missing arguments to prevent a runtime panic.
</issue_to_address>

### Comment 5
<location> `cmd/install.go:145` </location>
<code_context>
+		OutputPath:           c.outputPath,
+		Root:                 false,
+	})
+	if c.installVM.Run(); err != nil {
+		return err
+	}
+
</code_context>

<issue_to_address>
Ignored error from InstallVM.Run due to incorrect if syntax

Use `if err := c.installVM.Run(); err != nil` to correctly capture and check the error from `Run()`. The current syntax ignores the returned error.
</issue_to_address>

### Comment 6
<location> `cmd/install.go:105` </location>
<code_context>
+	if c.outputPath == "" {
+		return fmt.Errorf("the output-path needs to be set")
+	}
+	if c.configPath == "" {
+		return fmt.Errorf("the output-path needs to be set")
+	}
+	if c.containerStorage == "" {
</code_context>

<issue_to_address>
Error message references wrong flag for config-dir

The error message for `configPath` should mention `config-dir`, not `output-path`. Please correct the error text.
</issue_to_address>

### Comment 7
<location> `pkg/vm/installvm.go:129` </location>
<code_context>
+	}
+	defer dom.Free()
+	if err := dom.Destroy(); err != nil {
+		logrus.Warning("Failed to destroy the domain %s, maybe already stopped: %v", vm.domain, err)
+	}
+	if err := dom.Undefine(); err != nil {
</code_context>

<issue_to_address>
`logrus.Warning` used with format string but isn't formatted

Replace `logrus.Warning` with `logrus.Warningf` to correctly handle the format string.
</issue_to_address>

### Comment 8
<location> `pkg/vm/domain/domain.go:220` </location>
<code_context>
+	args := []string{"info", imagePath, "--output", "json"}
+	cmd := exec.Command(path, args...)
+	logrus.Debugf("Execute: %s", cmd.String())
+	stderr, err := cmd.StderrPipe()
+	if err != nil {
+		return "", fmt.Errorf("failed to get stderr for qemu-img command: %v", err)
</code_context>

<issue_to_address>
Using `StderrPipe` with `cmd.Output` may deadlock or lose stderr

`cmd.Output()` should not be used with `cmd.StderrPipe()` as it can cause deadlocks or missing stderr output. Use `CombinedOutput()` or handle stdout and stderr manually with `cmd.Start()`.
</issue_to_address>

### Comment 9
<location> `Makefile:26` </location>
<code_context>
 e2e_test: all
 	ginkgo -tags $(build_tags) ./test/...

+image:
+	podman build -t $(vm_image) --device /dev/kvm \
+	-f containerfiles/vm/Containerfile \
</code_context>

<issue_to_address>
Mark `image` as a phony target

Add `image` to `.PHONY` to prevent issues if a file named `image` exists.
</issue_to_address>

### Comment 10
<location> `pkg/vsock/proxy.go:121` </location>
<code_context>
+	}
+}
+
+func (proxy *Proxy) proxyFileToConn(ctx context.Context, file *os.File, conn net.Conn, errCh chan error) {
+	go func() {
+		_, err := io.Copy(conn, file)
</code_context>

<issue_to_address>
Consider merging the two proxy helper methods and related context logic into a single, symmetric io.Copy pattern to simplify the code.

```suggestion
You can collapse the two `proxyFileToConn`/`proxyConnToFile` helpers and the `context`/`select` machinery into a single, symmetric `io.Copy` pattern. This greatly reduces nesting and boilerplate while preserving cancellation via `p.done` in the accept loop.

1) Simplify Start/accept loop:

```go
func (p *Proxy) Start() error {
    _ = os.Remove(p.socket)
    ln, err := net.Listen("unix", p.socket)
    if err != nil {
        return fmt.Errorf("listen %q: %w", p.socket, err)
    }
    go func() {
        defer ln.Close()
        for {
            conn, err := ln.Accept()
            if err != nil {
                select {
                case <-p.done:
                    return
                default:
                    logrus.Warnf("accept error: %v", err)
                    continue
                }
            }
            go p.handleConnection(conn)
        }
    }()
    logrus.Debugf("Started proxy at: %s", p.socket)
    return nil
}
```

2) Flatten handleConnection with two `io.Copy` goroutines:

```go
func (p *Proxy) handleConnection(unixConn net.Conn) {
    defer unixConn.Close()

    fd, err := unix.Socket(unix.AF_VSOCK, unix.SOCK_STREAM, 0)
    if err != nil {
        logrus.Errorf("vsock socket error: %v", err)
        return
    }
    sa := &unix.SockaddrVM{CID: uint32(p.cid), Port: uint32(p.port)}
    if err := unix.Connect(fd, sa); err != nil {
        logrus.Errorf("vsock connect error: %v", err)
        _ = unix.Close(fd)
        return
    }
    vconn := os.NewFile(uintptr(fd), "vsock")
    if vconn == nil {
        logrus.Error("failed to wrap vsock fd")
        _ = unix.Close(fd)
        return
    }
    defer vconn.Close()

    // symmetrical copy
    errCh := make(chan error, 2)
    go func() { _, err := io.Copy(vconn, unixConn); errCh <- err }()
    go func() { _, err := io.Copy(unixConn, vconn); errCh <- err }()

    if err := <-errCh; err != nil && err != io.EOF {
        logrus.Errorf("proxy copy error: %v", err)
    }
}
```

3) Remove the unused `proxyFileToConn`, `proxyConnToFile` methods and the `context` imports. This preserves all behavior but slashes lines of code and cognitive overhead.
</issue_to_address>

### Comment 11
<location> `pkg/vm/domain/domain.go:89` </location>
<code_context>
+	}
+}
+
+func allocateDevices(d *libvirtxml.Domain) {
+	if d.Devices == nil {
+		d.Devices = &libvirtxml.DomainDeviceList{}
</code_context>

<issue_to_address>
Consider introducing a WithDevices helper to eliminate repeated device allocation and append logic in each With* function.

```go
// Add this helper to collapse the allocateDevices + append boilerplate:
func WithDevices(fn func(devs *libvirtxml.DomainDeviceList)) DomainOption {
    return func(d *libvirtxml.Domain) {
        if d.Devices == nil {
            d.Devices = &libvirtxml.DomainDeviceList{}
        }
        fn(d.Devices)
    }
}
```

Then refactor each With* to use it. For example, replace:

```go
func WithFilesystem(source, target string) DomainOption {
    return func(d *libvirtxml.Domain) {
        allocateDevices(d)
        d.Devices.Filesystems = append(d.Devices.Filesystems, libvirtxml.DomainFilesystem{
            Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
            Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
            Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
        })
    }
}
```

with:

```go
func WithFilesystem(source, target string) DomainOption {
    return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
        devs.Filesystems = append(devs.Filesystems, libvirtxml.DomainFilesystem{
            Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
            Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
            Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
        })
    })
}
```

And similarly for WithDisk, WithSerialConsole, WithInterface, and WithVSOCK:

```go
func WithDisk(path, serial, dev string, diskType DiskDriverType, bus DiskBus) DomainOption {
    return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
        devs.Disks = append(devs.Disks, libvirtxml.DomainDisk{
            Device: "disk",
            Driver: &libvirtxml.DomainDiskDriver{Name: "qemu", Type: diskType.String()},
            Source: &libvirtxml.DomainDiskSource{File: &libvirtxml.DomainDiskSourceFile{File: path}},
            Target: &libvirtxml.DomainDiskTarget{Bus: bus.String(), Dev: dev},
            Serial: serial,
        })
    })
}
```

This cuts out the repeated `allocateDevices(d)` and anonymous wrapper in each With* function, reducing boilerplate while preserving all behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

args := []string{"info", imagePath, "--output", "json"}
cmd := exec.Command(path, args...)
logrus.Debugf("Execute: %s", cmd.String())
stderr, err := cmd.StderrPipe()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Using StderrPipe with cmd.Output may deadlock or lose stderr

cmd.Output() should not be used with cmd.StderrPipe() as it can cause deadlocks or missing stderr output. Use CombinedOutput() or handle stdout and stderr manually with cmd.Start().

Makefile Outdated
@@ -18,6 +23,11 @@ integration_tests:
e2e_test: all
ginkgo -tags $(build_tags) ./test/...

image:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Mark image as a phony target

Add image to .PHONY to prevent issues if a file named image exists.

}
}

func (proxy *Proxy) proxyFileToConn(ctx context.Context, file *os.File, conn net.Conn, errCh chan error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider merging the two proxy helper methods and related context logic into a single, symmetric io.Copy pattern to simplify the code.

Suggested change
func (proxy *Proxy) proxyFileToConn(ctx context.Context, file *os.File, conn net.Conn, errCh chan error) {
You can collapse the two `proxyFileToConn`/`proxyConnToFile` helpers and the `context`/`select` machinery into a single, symmetric `io.Copy` pattern. This greatly reduces nesting and boilerplate while preserving cancellation via `p.done` in the accept loop.
1) Simplify Start/accept loop:
```go
func (p *Proxy) Start() error {
_ = os.Remove(p.socket)
ln, err := net.Listen("unix", p.socket)
if err != nil {
return fmt.Errorf("listen %q: %w", p.socket, err)
}
go func() {
defer ln.Close()
for {
conn, err := ln.Accept()
if err != nil {
select {
case <-p.done:
return
default:
logrus.Warnf("accept error: %v", err)
continue
}
}
go p.handleConnection(conn)
}
}()
logrus.Debugf("Started proxy at: %s", p.socket)
return nil
}
  1. Flatten handleConnection with two io.Copy goroutines:
func (p *Proxy) handleConnection(unixConn net.Conn) {
    defer unixConn.Close()

    fd, err := unix.Socket(unix.AF_VSOCK, unix.SOCK_STREAM, 0)
    if err != nil {
        logrus.Errorf("vsock socket error: %v", err)
        return
    }
    sa := &unix.SockaddrVM{CID: uint32(p.cid), Port: uint32(p.port)}
    if err := unix.Connect(fd, sa); err != nil {
        logrus.Errorf("vsock connect error: %v", err)
        _ = unix.Close(fd)
        return
    }
    vconn := os.NewFile(uintptr(fd), "vsock")
    if vconn == nil {
        logrus.Error("failed to wrap vsock fd")
        _ = unix.Close(fd)
        return
    }
    defer vconn.Close()

    // symmetrical copy
    errCh := make(chan error, 2)
    go func() { _, err := io.Copy(vconn, unixConn); errCh <- err }()
    go func() { _, err := io.Copy(unixConn, vconn); errCh <- err }()

    if err := <-errCh; err != nil && err != io.EOF {
        logrus.Errorf("proxy copy error: %v", err)
    }
}
  1. Remove the unused proxyFileToConn, proxyConnToFile methods and the context imports. This preserves all behavior but slashes lines of code and cognitive overhead.

}
}

func allocateDevices(d *libvirtxml.Domain) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider introducing a WithDevices helper to eliminate repeated device allocation and append logic in each With* function.

// Add this helper to collapse the allocateDevices + append boilerplate:
func WithDevices(fn func(devs *libvirtxml.DomainDeviceList)) DomainOption {
    return func(d *libvirtxml.Domain) {
        if d.Devices == nil {
            d.Devices = &libvirtxml.DomainDeviceList{}
        }
        fn(d.Devices)
    }
}

Then refactor each With* to use it. For example, replace:

func WithFilesystem(source, target string) DomainOption {
    return func(d *libvirtxml.Domain) {
        allocateDevices(d)
        d.Devices.Filesystems = append(d.Devices.Filesystems, libvirtxml.DomainFilesystem{
            Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
            Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
            Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
        })
    }
}

with:

func WithFilesystem(source, target string) DomainOption {
    return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
        devs.Filesystems = append(devs.Filesystems, libvirtxml.DomainFilesystem{
            Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
            Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
            Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
        })
    })
}

And similarly for WithDisk, WithSerialConsole, WithInterface, and WithVSOCK:

func WithDisk(path, serial, dev string, diskType DiskDriverType, bus DiskBus) DomainOption {
    return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
        devs.Disks = append(devs.Disks, libvirtxml.DomainDisk{
            Device: "disk",
            Driver: &libvirtxml.DomainDiskDriver{Name: "qemu", Type: diskType.String()},
            Source: &libvirtxml.DomainDiskSource{File: &libvirtxml.DomainDiskSourceFile{File: path}},
            Target: &libvirtxml.DomainDiskTarget{Bus: bus.String(), Dev: dev},
            Serial: serial,
        })
    })
}

This cuts out the repeated allocateDevices(d) and anonymous wrapper in each With* function, reducing boilerplate while preserving all behavior.

@cgwalters
Copy link
Contributor

Provide make image target and Containerfile to build a Fedora-based VM container image with embedded disk and preconfigured services

Hmm so in this model we'd ship that container image at e.g. quay.io/fedora/fedora-bootc-<something> and this tool would remain as a native host binary?

Or do you see the fact that this is a host binary as just a transient state? (As I was arguing in my prototype of bootc-kit). The host binary vs container image is a really important decision that impacts a lot of things...

We could I guess to some degree try to do both...but it could get ugly.

BTW this also relates strongly to bootc-dev/bootc#1359 (comment) in that I think having e.g. tmt or other testing tools in a container image will definitely argue for supporting having this tool be in a container image too.

A use case we will definitely want to support is one where we are invoked as part of other tooling. For people that want to do things via containers I hope it will feel really natural to basically do FROM quay.io/fedora/fedora-bootc-kit or so. But yes maybe some flows will feel most natural with a host binary, and it may force us to do both.


func DefaultContainerStorage() string {
usr, err := user.Current()
if err == nil && usr.Uid != "0" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use podman system info --format=json; there may be an API for this? But I dunno I have no problem with fork/exec where it's easy personally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you care to connect to the service using a binding, it is there yeah. else you could also consider podman info --format '{{.Store.GraphRoot}}'

"golang.org/x/sys/unix"
)

const podmanVMProxyDefault = "/tmp/podman-bootc-proxy.sock"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we can avoid a predictable filename in /tmp...probably via using /run?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely

}

func WithOS() DomainOption {
// TODO: fix this for multiarch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we omit this and have libvirt pick a default?

@@ -0,0 +1,38 @@
FROM quay.io/fedora/fedora:42 as builder

ENV URL https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/x86_64/images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a live chat about this before but: the more I think about this the stronger I feel that in the general case we really do need to use the target kernel to install.

It's the only way we can handle skew where only the target kernel knows how to do a particular filesystem type. Yeah today the Fedora kernel enables most filesystem types, but say (before bcachefs was merged) someone wanted to use that for their systems. With this model they'd have to override both the build environment container and the target container image.

Further, I think we can still run into issues where (unless one is careful with filesystem feature flags) this kernel may use filesystem feature flags not enabled by default in the target.

So I think the path we should go down here is where we do a direct kernel boot w/qemu, fetching the kernel from the target bootc image (which should always be in /usr/lib/modules/; we could add a helper command to extract that)

And if we're using the target kernel, there's going to be skew unless we use the target userspace too...

And so yes this sounds like a "build a hammer to make a hammer" problem but I think we can get out of that by building on the direct kernel boot above and then running viritofsd against the target container's merged root.

In other words we create a workflow "run a bootc container as an ephemeral VM w/virtiofs root" basically.


I think it'd be a big advantage to not need to maintain (and have users keep updated) this "builder VM" image.

Copy link
Contributor

@cgwalters cgwalters Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words we create a workflow "run a bootc container as an ephemeral VM w/virtiofs root" basically.

Which would be super useful as a general thing (especially for tests). This overlaps a lot with osbuild/image-builder-cli#189

I wonder actually if it's core enough that we could put the functionality for this directly...in podman perhaps? Really today even on Linux podman already links in a ton of code to do qemu/virtiofsd.

Doing some searching around this...I had forgotten about https://github.com/containers/crun-vm ...so much going on in this space of VMs and containers and bootc! I now feel bad for basically never trying to add that one into my "inner loop".

I tried out that project right now and none of the examples work for me (not the containerdisk one or the bootc one)...but I think we can make it work. EDIT: apparently I was running as root and that doesn't work. It's pretty good Rust code...

I think especially it's already in a place where it'd be well positioned to at least do this ephemeral VM w/virtiofs root approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To say this a different way, both this PR and krun (eventually via libkrunfw) would have the problem that if we wanted to productize it for RHEL, we'd need to change this so we aren't trying to ship a different kernel (and userspace!) build.

Copy link
Author

@alicefr alicefr Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Colin for the review. Those are really valid points. We could use podman create + mount in order to assemble the bootc container filesystem, and the booting from the artifacts in the merged directory. The current issue I'm facing is that for rootless podman, mount requires unshare and then libvirt cannot find the directory with the kernel, initrd and root since it runs in a different mount namespace.
We could extract the entire bootc image filesystem, but it is quite large to do it each time, and I'd like to find a smarter solution if possible.
Extracting the kernel and the initrd is pretty straightforward, but we are always missing the rootfs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters An alternative can be to use the image volumes. It is already supported by podman and kubernetes, so eventually we could use it there as well.
In this way, the bootc image can be made available to any other container, included the one with the sdk:
Example:

$ podman volume create --driver image --opt image=quay.io/centos-bootc/centos-bootc:stream9 bootc-vol
$ podman run -ti --rm -v bootc-vol:/bootc-data  quay.io/fedora/fedora:latest ls /bootc-data/lib/modules/5.14.0-590.el9.x86_64/{vmlinuz,initramfs.img}
/bootc-data/lib/modules/5.14.0-590.el9.x86_64/initramfs.img  /bootc-data/lib/modules/5.14.0-590.el9.x86_64/vmlinuz

I think this is a pretty elegant approach to make it flexible based on the desired booc image.

However, this implies that libvirt needs to have access to the bootc volume, hence, probably running in the same container of the sdk, or a second container using the same volume.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some searching around this...I had forgotten about https://github.com/containers/crun-vm ...so much going on in this space of VMs and containers and bootc! I now feel bad for basically never trying to add that one into my "inner loop".

The problem with crun-vm is that libvirt is still running on the host not inside the container. First, it won't have access to the image volume. Second, it also needs to be installed on the target host, while if we ship everything in a container image (including libvirt), then it doesn't need to be available in the target host.
The second point might be not strictly necessary if we only plan to deploy this on a local laptop, but better if we use the sdk image in a distributed CI system

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative can be to use the image volumes

Yes. Though note it's also possible from the CLI and API to skip the explicit volume create step:

$ podman run --rm -ti --mount=type=image,source=quay.io/fedora/fedora-bootc:42,target=/target busybox ls -al /target/usr/lib/modules
total 0
drwxr-xr-x    3 root     root            36 Jun 11 03:47 .
drwxr-xr-x   41 root     root           153 Jun 11 03:47 ..
drwxr-xr-x    7 root     root            27 Jun 11 03:47 6.14.9-300.fc42.x86_64
$

However, this implies that libvirt needs to have access to the bootc volume, hence, probably running in the same container of the sdk, or a second container using the same volume.

Right, in the general case (and relating to your previous comment) I think we probably need to handle three different cases:

Running on Linux, unprivileged podman

Most of the time this is a Linux desktop. I think it's fine to assume some software gets installed on the host, e.g. qemu and we can assume host has e.g. libvirt (at least optionally for persistent pet machines).

Running on Linux, root in e.g. Github Action runner (w/nested KVM available)

This is basically the "headless CI job" case of the above. Of course it could also be non-root.

Running in Kubernetes with /dev/kvm mounted in (current coreos-assembler flow)

Here we must have all userspace (e.g. qemu) in the sdk image. We can probably assume https://kubernetes.io/docs/tasks/configure-pod-container/image-volumes/ is available.


I think it's a bit of an open question if the middle case looks more like the first or the last. For GHA, I think it's totally sane to have a setup phase which does apt install libvirt etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third case, in kubernetes, we will need a device plugin which allocates /dev/kvm for the container. Otherwise, it will require to run the pod as privileged to use hostpath and a volume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third case, in kubernetes, we will need a device plugin which allocates /dev/kvm for the container.

Yeah I made https://github.com/cgwalters/kvm-device-plugin

That said, I did that because at the time the people running the cluster didn't want to scope in installing KubeVirt. But that's long in the past, and maybe it makes sense to just require KubeVirt, even if the use case is "parallel" to it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes KubeVirt would solve the problem. They should also now have the support for image volumes, but I need to double check, it was an ongoing effort. They support direct boot from containerdisk, the part I'm not sure is how to pass the rootfs with virtiofs from an image volume. This part might be still missing

@alicefr
Copy link
Author

alicefr commented Jun 17, 2025

@germag @cgwalters after the discussion in #95 (comment) , not sure if podman-bootc is the right place for this.

My main reason why I choose it, it was because it was written in go and has the concept of running bootc VM. I think golang has the advantage over rust that we have libvirtxml, and libvirt and podman binding which simplify the integration quite a bit

The command is pretty self-contained so it can be moved to a standalone tool

@cgwalters
Copy link
Contributor

@germag @cgwalters after the discussion in #95 (comment) , not sure if podman-bootc is the right place for this.

Well...I think an outcome of the work here is definitely that we have "podman-bootc successor", and there's some useful code here for sure. For example #58 is still relevant too.

I'd just say again I think an issue is that the scope of what we need from "sdk" is a lot larger than what this project does today (and the above wrapper for b-i-b is taking it beyond what either "podman" is (e.g. uploading AMIs) or what bootc is itself); also adding things like the "install with anaconda" wrapper flow or cloud provisioning too. That said what may be really nice is to have a "core sdk" which only deals with operating system independent things (e.g. only knows about bootc, podman, systemd, libvirt say) that lives at...ghcr.io/bootc-dev/kit e.g. Then things like bootc-image-builder and Anaconda (and knowledge of cloud images) are currently Fedora derivative specific, but those can be wrapped up in a higher level quay.io/fedora/fedora-bootc-sdk.


Anyways though procedurally I don't see a problem in landing code here. Or we could fork it if that's easier.

This is all definitely a very complex topic because this project has the gravity of being a naive binary today, but today bootc-image-builder comes as a container (and PRs like the above need to deal with skew between them), and then other things (exactly like the cloud image container added here) also come as a container and I think things will be simplest for us in the longer term if we basically always have the SDK logic run in a container (like coreos-assembler), with just an optional shim binary that can be installed on the host as a convenient shortcut.

}
usr, err := user.Current()
if err == nil && usr.Uid != "0" {
return "/run/user/" + usr.Uid + "/podman/podman.sock"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be had from podman info or via the bindings

@alicefr
Copy link
Author

alicefr commented Jun 19, 2025

@cgwalters Based on the recent discussions I tried something different. This code doesn't reflect my experiments yet, but they are avaialble at https://github.com/alicefr/bootc-appliance-demo/tree/dynamic-boot.

Since we want to use the target bootc image as installation environment I have switch to direct boot using the bootc image as rootfs.
Now, the steps are the following ones:

  1. podman-bootc should take the desired bootc image and built on top a layer with the dependencies required for the bootc installation setup. Basically, this Containerfile. The base image is configurable and then we add additional tooling on top.
  2. We create a container with libvirt inside and an image volume. In this case, the image volume isn't directly the bootc-image, but the bootc image + the dependencies we added in step 1
  3. We boot the installation VM with podman remote configured and we can perform the bootc installation from inside the VM

Please, note that the VM has access to the host container storage so we can use the original bootc image to perform the installation.

This picture summarize the step visually:
podman-bootc

I think this setup facilitates the 3 scenarios since we have already containerized the virt stack. We can always move podman-bootc and podman inside their own container. We control podman remotely though a socket. So, this can be easily shared with a volume in case we decide to deploy it separately.
As the same, we can always put the container storage on a shared volume between the podman bootc container and the virt container or put everything in the same container.

@alicefr
Copy link
Author

alicefr commented Jun 19, 2025

@cgwalters please let me know what you think and I can translate my scripts in my repo into go code and update this PR accordingly

@cgwalters
Copy link
Contributor

That picture is cool 😄
I will take a look at what you posted (and am definitely curious for feedback from others).

One point I'll make is I'd like to not "stop energy" ourselves, I'd love to get something useful here even if it's not ideal and we can always iterate from there - perfection being the enemy of the good etc. So to that point, from my PoV I'd say if you have something even partially working here I think we could really consider pretty quickly merging it into podman-bootc git main (as long as we're not breaking the functionality that exists now).

@cgwalters
Copy link
Contributor

One thing I'll say here is there two competing things in this install phase:

  • bootc install to-disk (as we're using it today) is extremely simple and requires zero external infrastructure besides the target container image itself, but is also currently only for "toy" scenarios
  • We'll want to support the other (three!) installation methods of Anaconda, bootc-image-builder, and reinstalling a cloud image at some point too

One thing I have definitely been thinking about more is whether we could push more for using systemd repart.d as a way to make the first case less "toy", it already very cleanly supports embedding partitioning inside the OS image itself.

Anyways, let's keep up with using to-disk for now, but leave conceptual space for the latter.

@cgwalters
Copy link
Contributor

First we should default to taking the target image as an argument, I did

diff --git i/demo.sh w/demo.sh
index 3cc140e..83a8cc3 100755
--- i/demo.sh
+++ w/demo.sh
@@ -1,5 +1,14 @@
 #!/bin/bash
 
+# check argument count
+if [ $# -lt 1 ]; then
+       echo "Usage: $0 <image>"
+       exit 1
+fi
+
+IMAGE=$1
+shift
+
 NAME=bootc-build
 DISK=$(pwd)/disk.img
 # Directory where the bootc artifacts will be stored
@@ -11,7 +20,6 @@ STORAGE_DIR=$HOME/.local/share/containers/storage
 
 # Directory where to store the configuration for bootc
 CONFIG_DIR=$(pwd)/config
-BOOTC_IMAGE=demo:latest
 
 CID=3
 VMPORT=1234
@@ -27,8 +35,6 @@ set -xe
 mkdir -p $STORAGE_DIR
 mkdir -p $OUTPUT_DIR
 
-podman build -t $BOOTC_IMAGE -f Containerfile.example .
-
 # The output image where we install the OS with bootc
 rm -f $OUTPUT
 qemu-img create -f qcow2 $OUTPUT 10G

podman-bootc should take the desired bootc image and built on top a layer with the dependencies required for the bootc installation setup. Basically, this Containerfile. The base image is configurable and then we add additional tooling on top.

I don't think we need to do a build of a new image (for that we'd need to consider GC, concurrency issue) - we can just run the target container image and inject that configuration dynamically on top as a command right? Think of it like a Kubernetes pod with an injected entrypoint shell script.

This code doesn't reflect my experiments yet, but they are avaialble at https://github.com/alicefr/bootc-appliance-demo/tree/dynamic-boot.

To be clear then is this vm-disk a transient state then?

@cgwalters
Copy link
Contributor

I think a good way to describe a target goal to start is: "The functionality of podman-bootc but without the requirement for podman machine on Linux, which we handle by creating a transient lightweight bootstrap VM using the target container image itself" or so.

Maybe I was too aggressive in saying though that this tool should come as a container image itself. A native host binary may actually be a lot more ergonomic in the end, or again we may in the limit need to do both to some degree.

@alicefr
Copy link
Author

alicefr commented Jun 20, 2025

First we should default to taking the target image as an argument, I did

diff --git i/demo.sh w/demo.sh
index 3cc140e..83a8cc3 100755
--- i/demo.sh
+++ w/demo.sh
@@ -1,5 +1,14 @@
 #!/bin/bash
 
+# check argument count
+if [ $# -lt 1 ]; then
+       echo "Usage: $0 <image>"
+       exit 1
+fi
+
+IMAGE=$1
+shift
+
 NAME=bootc-build
 DISK=$(pwd)/disk.img
 # Directory where the bootc artifacts will be stored
@@ -11,7 +20,6 @@ STORAGE_DIR=$HOME/.local/share/containers/storage
 
 # Directory where to store the configuration for bootc
 CONFIG_DIR=$(pwd)/config
-BOOTC_IMAGE=demo:latest
 
 CID=3
 VMPORT=1234
@@ -27,8 +35,6 @@ set -xe
 mkdir -p $STORAGE_DIR
 mkdir -p $OUTPUT_DIR
 
-podman build -t $BOOTC_IMAGE -f Containerfile.example .
-
 # The output image where we install the OS with bootc
 rm -f $OUTPUT
 qemu-img create -f qcow2 $OUTPUT 10G

podman-bootc should take the desired bootc image and built on top a layer with the dependencies required for the bootc installation setup. Basically, this Containerfile. The base image is configurable and then we add additional tooling on top.

I don't think we need to do a build of a new image (for that we'd need to consider GC, concurrency issue) - we can just run the target container image and inject that configuration dynamically on top as a command right? Think of it like a Kubernetes pod with an injected entrypoint shell script.

Sure, what can be done in the additional layer, can be done dynamically. The thing I like in splitting the VM configuration and the booting is that we can cache the intermediate image/layer with the dependencies. For example, if want to do multiple builds from the same image or something fails in between, then we can start from this layer. Additionally, it works well with image volumes.

This code doesn't reflect my experiments yet, but they are avaialble at https://github.com/alicefr/bootc-appliance-demo/tree/dynamic-boot.

To be clear then is this vm-disk a transient state then?

Yes, it is a layer for caching the installation VM configuration steps generated dynamically on top of the bootc image.

@alicefr
Copy link
Author

alicefr commented Jun 20, 2025

I think a good way to describe a target goal to start is: "The functionality of podman-bootc but without the requirement for podman machine on Linux, which we handle by creating a transient lightweight bootstrap VM using the target container image itself" or so.

Maybe I was too aggressive in saying though that this tool should come as a container image itself. A native host binary may actually be a lot more ergonomic in the end, or again we may in the limit need to do both to some degree.

I think if we keep the virtualization stack in the container in all the scenarios and on the host podman-bootc and podman, then it become very easy to containerized them if necessary. Since it basically becomes podman run inside a container.

@jlebon
Copy link

jlebon commented Jun 20, 2025

Had to do

diff --git a/demo.sh b/demo.sh
index 126ec81..8967021 100755
--- a/demo.sh
+++ b/demo.sh
@@ -38,7 +38,6 @@ podman run -td --name libvirt \
 	-v /dev/kvm:/dev/kvm \
 	-v /dev/vhost-net:/dev/vhost-net \
 	-v /dev/vhost-vsock:/dev/vhost-vsock \
-	-v /dev/vsock:/dev/vsock \
 	-v $OUTPUT_DIR:/usr/lib/bootc/output \
 	-v $CONFIG_DIR:/usr/lib/bootc/config \
 	-v $CONT_STORAGE:/usr/lib/bootc/container_storage \

But otherwise it worked well.

Nice job!

Some feedback:

I love the idea of booting a VM right off of the container image. Almost feels like that flow could itself be the main experience in a developer's iteration cycle? It doesn't seem very different from what you'd get if you were to build live PXE artifacts from your container image and boot from that. What's really interesting though is that it defers the more expensive "create disk image" path until you need it (and sure, for some developers, maybe you will almost always need that depending on what you're hacking, but for end users wanting to iterate on their configuration bits, it's good enough while providing more fidelity than testing as a container image).

Sure, what can be done in the additional layer, can be done dynamically. The thing I like in splitting the VM configuration and the booting is that we can cache the intermediate image/layer with the dependencies. For example, if want to do multiple builds from the same image or something fails in between, then we can start from this layer.

Hmm, looking at the Containerfile, it's mostly writing files so it doesn't seem worth caching. The exception is socat, though maybe we could fold that functionality into some hidden bootc command? Would definitely be nice to get rid of that build step.

@jlebon
Copy link

jlebon commented Jun 20, 2025

On the containerize vs native discussion, when you think about the building and testing-as-a-container cycle locally, all you need is podman. I think it's a compelling story if the additional functionality we build here (testing-as-a-VM, testing via Anaconda, etc...) doesn't expand that requirement. And yeah, as mentioned that plays itself out in instrumenting this in CI/k8s too.

Relatedly, to repeat something from #28, I find the "run a bootable OCI image as a VM with a single podman command" story really powerful. I still use my code from there even now even though it's not as efficient. Not saying it should be a hard requirement or anything, but it deserves some weight I think in the design.

@alicefr
Copy link
Author

alicefr commented Jun 23, 2025

Had to do

diff --git a/demo.sh b/demo.sh
index 126ec81..8967021 100755
--- a/demo.sh
+++ b/demo.sh
@@ -38,7 +38,6 @@ podman run -td --name libvirt \
 	-v /dev/kvm:/dev/kvm \
 	-v /dev/vhost-net:/dev/vhost-net \
 	-v /dev/vhost-vsock:/dev/vhost-vsock \
-	-v /dev/vsock:/dev/vsock \
 	-v $OUTPUT_DIR:/usr/lib/bootc/output \
 	-v $CONFIG_DIR:/usr/lib/bootc/config \
 	-v $CONT_STORAGE:/usr/lib/bootc/container_storage \

Ah, yes nice catch!

Sure, what can be done in the additional layer, can be done dynamically. The thing I like in splitting the VM configuration and the booting is that we can cache the intermediate image/layer with the dependencies. For example, if want to do multiple builds from the same image or something fails in between, then we can start from this layer.

Hmm, looking at the Containerfile, it's mostly writing files so it doesn't seem worth caching. The exception is socat, though maybe we could fold that functionality into some hidden bootc command? Would definitely be nice to get rid of that build step.

Socat can be replaced by a tiny binary which does the proxy. It can be statically compiled and injected dynamically eventually, so it isn't an hard requirement. Ideally, we could also extend podman to support vsock protocol, and the proxy won't be required anymore.
The build also creates some symlinks for the systemd service but it is something it can be done before starting the VM. The password and user creation was mostly for debugging purposes, but it can be skipped.

So, AFAIU, you would prefer to skip the intermediate image, and inject the pre-steps every time we start the container?

@alicefr
Copy link
Author

alicefr commented Jun 23, 2025

On the containerize vs native discussion, when you think about the building and testing-as-a-container cycle locally, all you need is podman. I think it's a compelling story if the additional functionality we build here (testing-as-a-VM, testing via Anaconda, etc...) doesn't expand that requirement. And yeah, as mentioned that plays itself out in instrumenting this in CI/k8s too.

Relatedly, to repeat something from #28, I find the "run a bootable OCI image as a VM with a single podman command" story really powerful. I still use my code from there even now even though it's not as efficient. Not saying it should be a hard requirement or anything, but it deserves some weight I think in the design.

I think since we have moved the virt stack inside the container, now the only part that is on the host are the client libraries to talk to libvirt and podman remote. So, this command can be, now, easily integrated in podman or be containerized. The requirements for this setup are:

  • Having access/share the container storage between podman-bootc and the container
  • The VM Podman socket (the proxy between vsock -> unix socket)
  • The libvirt socket
  • The devices like kvm, vsock and vhost-vsock

@cgwalters
Copy link
Contributor

Relatedly, to repeat something from #28, I find the "run a bootable OCI image as a VM with a single podman command" story really powerful. I still use my code from there even now even though it's not as efficient. Not saying it should be a hard requirement or anything, but it deserves some weight I think in the design.

Right, sorry I forgot about that PR in the mix of all the things here. Thanks so much for starting that and agree it should carry a lot of weight here!

@cgwalters
Copy link
Contributor

The devices like kvm, vsock and vhost-vsock

BTW I think --device /dev/kvm is preferred over -v /dev/kvm because the former will ensure that if e.g. device cgroups are in use that they're updated to allow it.

@alicefr
Copy link
Author

alicefr commented Jun 24, 2025

@jlebon @cgwalters so you are both against the intermediate image layer? It's just to know how to proceed, so then I can translate into code what I have in my demo repo.

@cgwalters
Copy link
Contributor

@jlebon @cgwalters so you are both against the intermediate image layer? It's just to know how to proceed, so then I can translate into code what I have in my demo repo.

I guess as a general rule what I'd say is we should as much as possible avoid side effects on other global state. The intermediate image layer is just one example of that. Another is the podman system connection add in the demo script - that currently leaks after a disk image is complete too.

I think it will be a lot cleaner without that intermediate layer, but we should not let perfection be the enemy of the good - so if it helps please do just do that in the initial translation! There's so many shell scripts and such in this area with various suboptimal tradeoffs that I think it's not a super high bar to clear to do better.

Wait sorry though - backing up to a higher level, where are we with having src == target (ref previous thread). That one I'm more concerned about; for some reason I had thought we'd resolved that but the current demo.sh is bootstrapping via Fedora Cloud. WDYT about that idea of bootstrapping via a virtiofs root? Another alternative I can see here is where we support a to-disk flow that writes e.g. an erofs directly (without any privileges required).

@alicefr
Copy link
Author

alicefr commented Jun 25, 2025

Thanks for the review I will try to avoid the intermediate layer and reduce the shell scripts. A lot of they can be converted in the sdk code.

@cgwalters
Copy link
Contributor

@alicefr how's it going? This issue is pretty near the top of my list, anything I can help with?

@cgwalters
Copy link
Contributor

btw @jlebon this "build using kernel from target" is among the big reasons why I think what we do in coreos-assembler today should also be replaced by this. It just causes "host contamination" really to do anything else (esp in a world where coreos-assembler is always using fedora content)

@alicefr
Copy link
Author

alicefr commented Jul 1, 2025

@cgwalters I'm finishing the second version, sorry I'm splitting my time and not very quick on this. I hope I'am able to push the new version today

alicefr added 7 commits July 2, 2025 09:03
The vm image contains the virtualization stack to launch the virtual
machine, the files to compose the VM configuration in the bootc-data
volume.

The entrypoint prepares the bootc-data volume in order to be able to
boot from the bootc image mounted at that location plus the
configurations required by the installation.

The systemd services mount the virtiofs targets, while the
podman-vsock-proxy starts the proxy from VSOCK to the local unix socket
for podman.

The virtiofs-wrapper script is required in order to add extra option
when virtiofs is launched by libvirt. This is necessary in order to
correctly launch virtiofs inside an unprivileged container.

Signed-off-by: Alice Frosi <[email protected]>
The podman package abstract the methods interacting with podman. It
mainly contains the methods in order to launch the vm container with
libvirt and QEMU, and create the remote container for bootc inside the
VM.

Signed-off-by: Alice Frosi <[email protected]>
The proxy translates the unix socket to the vsock and viceversa:
   $ vsock-proxy --log-level debug -s /run/podman/podman-vm.sock -p 1234 \
      --cid 3 --listen-mode unixToVsock

Based on the listen-mode flag, it proxies in the connection. In the
above example, for each connection at the unix socket specified by -s,
it connects to the VM with cid 3 and port 1234.

Signed-off-by: Alice Frosi <[email protected]>
Go doesn't offer a generic method to convert a type in to its pointer.
This method is practical for the podman bindings where there are several
field which are pointer, and without this method, you would need to
define a variable and then pass its address. E.g for booleans:
   foo := utils.Ptr(true)
instead of
   t := true
   foo := &t

Signed-off-by: Alice Frosi <[email protected]>
The domain package helps to abstract and builds a libvirt domain.

Signed-off-by: Alice Frosi <[email protected]>
The installation VM is a temporary VM used for running privileged
commands using rootless podman on the host.
This VM boots from the kernel and initrd available in the bootc image.
As rootfs, it uses the rootfs from the bootc container image and
virtiofs.

Signed-off-by: Alice Frosi <[email protected]>
The install command runs bootc install inside the installation VM. In
this way, it is possible to run privileged command using rootless
podman.

The install command requires to specify:
  - the bootc image to use for the installation
  - the configuration directory where to find the config.toml for
    configuring the output image
  - the container storage directory, this is passed up to the remote
    container in order to perform the bootc installation using the
    container image
  - the output directory where to locate the build artifacts
  - the name of the output disk image

Example:
$ podman-bootc install --bootc-image quay.io/centos-bootc/centos-bootc:stream9 \
   --output-dir $(pwd)/output --output-image output.qcow2  --config-dir $(pwd)/config \
   -- bootc install to-disk /dev/disk/by-id/virtio-output --wipe

Signed-off-by: Alice Frosi <[email protected]>
@alicefr
Copy link
Author

alicefr commented Jul 2, 2025

@cgwalters @germag PTAL. The newv version doesn't include the intermediate image layer anymore. It boots the VM from the bootc image, and the container entrypoint prepares the configuration for the installation. If you like the current approach I can add the unit tests and clean the code a bit further. Thanks!

@jlebon
Copy link

jlebon commented Jul 3, 2025

btw @jlebon this "build using kernel from target" is among the big reasons why I think what we do in coreos-assembler today should also be replaced by this. It just causes "host contamination" really to do anything else (esp in a world where coreos-assembler is always using fedora content)

We do nowadays always use the target userspace to create disk images (well, technically for RHCOS only, but it matters less for FCOS anyway since it's already Fedora). So e.g. the mkfs.ext4 comes from RHEL. Which I think is the most important part, right? AFAIK compatibility issues we've hit in the past with e.g. filesystem feature flags have stemmed from that difference rather than actually filesystem kernel code involved post-mkfs.

@cgwalters
Copy link
Contributor

AFAIK compatibility issues we've hit in the past with e.g. filesystem feature flags have stemmed from that difference rather than actually filesystem kernel code involved post-mkfs.

This is true, but there's still huge possibility of skew I think.

@cgwalters
Copy link
Contributor

Thanks, I took a quick try at running the code and got Error: no such image: quay.io/containers/bootc-vm:latest: image not known - but we are trying to drop that right?

I tried this

git diff --cached
diff --git c/containerfiles/vm/entrypoint.sh i/pkg/podman/entrypoint.sh
similarity index 100%
rename from containerfiles/vm/entrypoint.sh
rename to pkg/podman/entrypoint.sh
diff --git c/pkg/podman/podman.go i/pkg/podman/podman.go
index 7425216..0f63360 100644
--- c/pkg/podman/podman.go
+++ i/pkg/podman/podman.go
@@ -144,6 +144,7 @@ func (c *VMContainer) Run() error {
        if err != nil {
                return err
        }
+       log.Debugf("Created container %s", c.contID)
 
        if err := containers.Start(ctx, c.contID, &containers.StartOptions{}); err != nil {
                return fmt.Errorf("failed to start the bootc container: %v", err)
@@ -177,17 +178,21 @@ func pullImage(ctx context.Context, image string) error {
        return nil
 }
 
+//go:embed entrypoint.sh
+var entrypoint string
+
 func createVMContainer(ctx context.Context, image string, opts *RunVMContainerOptions) (string, error) {
        if err := pullImage(ctx, image); err != nil {
                return "", err
        }
+
        specGen := &specgen.SpecGenerator{
                ContainerBasicConfig: specgen.ContainerBasicConfig{
-                       Command: []string{"/entrypoint.sh"},
+                       Command: []string{"/bin/sh", "-c", entrypoint},
                        Stdin:   utils.Ptr(true),
                },
                ContainerStorageConfig: specgen.ContainerStorageConfig{
-                       Image: vm.VMImage,
+                       Image: image,
                        ImageVolumes: []*specgen.ImageVolume{
                                {
                                        Destination: vm.BootcDir,

Which got me a little farther to an error about a stopped container; haven't debugged that yet.

@cgwalters
Copy link
Contributor

OK I think I see the overall issue here. There's a large tension here if we want the virt stack on the host or in a container image.

Chatting with @germag yesterday, he brought up the pain point of having container images as CLI tools makes integration with the shell/system annoying.

So yes, maybe it's less painful to just require virt tools on the host. The binary already links to libvirt.so so it's not really portable, and we may as well just require host tools.

I think it'd still be good to support running in a container image with virt tools embedded, but I see eventually this tooling being embedded in larger container images (e.g. like coreos-assembler) so us having a prebuilt container image is of less value. (Though it's nonzero)

@cgwalters
Copy link
Contributor

Here's two patches I wrote while debugging so far

From 7690ea5df560c23ad6876ec010a5b4fab74c49f8 Mon Sep 17 00:00:00 2001
From: Colin Walters <[email protected]>
Date: Tue, 8 Jul 2025 21:39:22 -0400
Subject: [PATCH 1/2] check for entrypoint errors

---
 pkg/podman/podman.go | 43 +++++++++----------------------------------
 1 file changed, 9 insertions(+), 34 deletions(-)

diff --git a/pkg/podman/podman.go b/pkg/podman/podman.go
index 7425216..0aae8b6 100644
--- a/pkg/podman/podman.go
+++ b/pkg/podman/podman.go
@@ -4,8 +4,8 @@ import (
 	"bytes"
 	"context"
 	"fmt"
-	"io"
 	"os"
+	"os/exec"
 	"os/user"
 	"path/filepath"
 	"strings"
@@ -19,12 +19,10 @@ import (
 	log "github.com/sirupsen/logrus"
 
 	"github.com/containers/podman/v5/libpod/define"
-	"github.com/containers/podman/v5/pkg/api/handlers"
 	"github.com/containers/podman/v5/pkg/bindings"
 	"github.com/containers/podman/v5/pkg/bindings/containers"
 	"github.com/containers/podman/v5/pkg/bindings/images"
 	"github.com/containers/podman/v5/pkg/specgen"
-	"github.com/docker/docker/api/types"
 )
 
 type RunVMContainerOptions struct {
@@ -47,37 +45,14 @@ type VMContainer struct {
 }
 
 func ExecInContainer(ctx context.Context, containerID string, cmd []string) (string, error) {
-	execCreateOptions := &handlers.ExecCreateConfig{
-		ExecConfig: types.ExecConfig{
-			Tty:          true,
-			AttachStdin:  true,
-			AttachStderr: true,
-			AttachStdout: true,
-			Cmd:          cmd,
-		},
-	}
-	execID, err := containers.ExecCreate(ctx, containerID, execCreateOptions)
-	if err != nil {
-		return "", fmt.Errorf("exec create failed: %w", err)
-	}
-	// Prepare streams
-	var stdoutBuf, stderrBuf bytes.Buffer
-	var stdout io.Writer = &stdoutBuf
-	var stderr io.Writer = &stderrBuf
-	// Start exec and attach
-	err = containers.ExecStartAndAttach(ctx, execID, &containers.ExecStartAndAttachOptions{
-		OutputStream: &stdout,
-		ErrorStream:  &stderr,
-		AttachOutput: utils.Ptr(true),
-		AttachError:  utils.Ptr(true),
-	})
-	if err != nil {
-		return "", fmt.Errorf("exec start failed: %w", err)
-	}
-
-	// Handle output and errors
-	if stderrBuf.Len() > 0 {
-		return "", fmt.Errorf("stderr: %s", stderrBuf.String())
+	c := exec.Command("podman", "exec", containerID)
+	c.Args = append(c.Args, cmd...)
+	stdoutBuf := bytes.Buffer{}
+	stderrBuf := bytes.Buffer{}
+	c.Stdout = &stdoutBuf
+	c.Stderr = &stderrBuf
+	if err := c.Run(); err != nil {
+		return "", fmt.Errorf("exec failed: %w %s", err, stderrBuf.String())
 	}
 
 	return stdoutBuf.String(), nil
-- 
2.49.0


From ed02814aa3c7c0952e0c23a7bcc3e771e34e9ca7 Mon Sep 17 00:00:00 2001
From: Colin Walters <[email protected]>
Date: Tue, 8 Jul 2025 21:39:34 -0400
Subject: [PATCH 2/2] Only pull if not present

Matches the podman default
---
 pkg/podman/podman.go | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/pkg/podman/podman.go b/pkg/podman/podman.go
index 0aae8b6..c692aba 100644
--- a/pkg/podman/podman.go
+++ b/pkg/podman/podman.go
@@ -145,6 +145,15 @@ func isContainerRunning(ctx context.Context, name string) (bool, error) {
 }
 
 func pullImage(ctx context.Context, image string) error {
+	// Only pull if not present
+	exists, err := images.Exists(ctx, image, nil)
+	if err != nil {
+		return fmt.Errorf("failed to check if image exists: %w", err)
+	}
+	if exists {
+		log.Debugf("Image %s already exists, skipping pull.", image)
+		return nil
+	}
 	if _, err := images.Pull(ctx, image, &images.PullOptions{}); err != nil {
 		return fmt.Errorf("failed to pull image %s: %w", image, err)
 	}
-- 
2.49.0


@cgwalters
Copy link
Contributor

With the first patch, it's seemingly quite involved to do an exec and check for errors from the child process...what I was getting is that the exec failed but we just stumbled on.

Related to this, previously: #61

That took me a long time to debug and I think that issue will come back if we keep trying to use the exec/attach API here. Yeah, forking processes feels unclean sometimes, but...

@alicefr
Copy link
Author

alicefr commented Jul 11, 2025

OK I think I see the overall issue here. There's a large tension here if we want the virt stack on the host or in a container image.

Before doing a further version of this code, I really would like to clarify this point.
Please, keep in mind that if we don't want to run libvirt inside the container, we need somehow to assemble the bootc image file system and make it available to libvirt and QEMU, in a way that they can boot from it. Podman can mount the container filesystem, but for rootless, only with unshare, and again it is problematic for libvirt on the host to access the bootc root filesystem.

If we run everything in a container, we avoid this problem completely.

Chatting with @germag yesterday, he brought up the pain point of having container images as CLI tools makes integration with the shell/system annoying.

Not sure I really understand this point, what are the issues here?

So yes, maybe it's less painful to just require virt tools on the host. The binary already links to libvirt.so so it's not really portable, and we may as well just require host tools.

True, we can also containerize it.

I think it'd still be good to support running in a container image with virt tools embedded, but I see eventually this tooling being embedded in larger container images (e.g. like coreos-assembler) so us having a prebuilt container image is of less value. (Though it's nonzero)

it is always possible to move podman-bootc into a container. The only thing it requires is to have the ability to control the podman on the host, and this is more or less how docker-in-docker works.

@alicefr
Copy link
Author

alicefr commented Jul 11, 2025

With the first patch, it's seemingly quite involved to do an exec and check for errors from the child process...what I was getting is that the exec failed but we just stumbled on.

Related to this, previously: #61

That took me a long time to debug and I think that issue will come back if we keep trying to use the exec/attach API here. Yeah, forking processes feels unclean sometimes, but...

I will take a look, sorry for the bug

@alicefr
Copy link
Author

alicefr commented Jul 11, 2025

@cgwalters @germag the other point I really would like to address is that we should simplify the command like. As first draft, I'm passing the full bootc cmdline, but this is wrong imo. Certain part of the command line are fixed, like the paths and how the device is showing up inside the guest. This part requires further tuning as well

@alicefr
Copy link
Author

alicefr commented Jul 11, 2025

With the first patch, it's seemingly quite involved to do an exec and check for errors from the child process...what I was getting is that the exec failed but we just stumbled on.

Related to this, previously: #61

That took me a long time to debug and I think that issue will come back if we keep trying to use the exec/attach API here. Yeah, forking processes feels unclean sometimes, but...

@cgwalters I'm not a big fan of calling podman directly, especially if we want to containerize this. Otherwise, it will require to have podman install as well, while we might just use podman remote.

@alicefr
Copy link
Author

alicefr commented Jul 11, 2025

 func pullImage(ctx context.Context, image string) error {
+	// Only pull if not present
+	exists, err := images.Exists(ctx, image, nil)
+	if err != nil {
+		return fmt.Errorf("failed to check if image exists: %w", err)
+	}
+	if exists {
+		log.Debugf("Image %s already exists, skipping pull.", image)
+		return nil
+	}
 	if _, err := images.Pull(ctx, image, &images.PullOptions{}); err != nil {
 		return fmt.Errorf("failed to pull image %s: %w", image, err)
 	}
-- 
2.49.0

@cgwalters Is this really necessary or was it for debugging? My idea, here, was to rely on podman default for pulling the image. We might also choose to always pull the image or when we have a new version, so this part doesn't seems complexly correct

@alicefr
Copy link
Author

alicefr commented Jul 11, 2025

Thanks, I took a quick try at running the code and got Error: no such image: quay.io/containers/bootc-vm:latest: image not known - but we are trying to drop that right?

@cgwalters my bad, the reason why this is failing is because you need to build the image first. We haven't, of course, released the image yet, so before to try the command, you need to run make image. I will add this in the description as well. Sorry for the misunderstanding

@cgwalters
Copy link
Contributor

@cgwalters Is this really necessary or was it for debugging? My idea, here, was to rely on podman default for pulling the image. We might also choose to always pull the image or when we have a new version, so this part doesn't seems

the podman/docker default is --pull=missing, but the code is implementing --pull=always right?

@cgwalters
Copy link
Contributor

Before doing a further version of this code, I really would like to clarify this point.
Please, keep in mind that if we don't want to run libvirt inside the container, we need somehow to assemble the bootc image file system and make it available to libvirt and QEMU, in a way that they can boot from it. Podman can mount the container filesystem, but for rootless, only with unshare, and again it is problematic for libvirt on the host to access the bootc root filesystem.
If we run everything in a container, we avoid this problem completely.

Is it correct that the status quo for this PR is that we require a host binary and a container image? That seems like it's going to combine the disadvantages of both, and we'll need to think about skew, right?

The container image is basically "libvirt + vsock proxy + config" and aside from the vsock proxy we could get that from anywhere right? e.g. in a use case for "host libvirt" it can easily come from the host. I'd like to drill down into the vsock proxy a bit; ideally we can ship that somewhere by default (maybe in podman?).

--

Different topic, I don't understand why we are requiring a configdir, that seems just wrong. I did

diff --git i/cmd/install.go w/cmd/install.go
index 5bfa40a..c641a19 100644
--- i/cmd/install.go
+++ w/cmd/install.go
@@ -89,9 +89,6 @@ func (c *installCmd) validateArgs() error {
        if c.outputPath == "" {
                return fmt.Errorf("the output-path needs to be set")
        }
-       if c.configPath == "" {
-               return fmt.Errorf("the config-dir needs to be set")
-       }
        if c.containerStorage == "" {
                return fmt.Errorf("the container storage cannot be empty")
        }

The idea with bootc install is the target configuration comes from the target image. (This is distinct from bootc-image-builder which very confusingly also has a config.toml which is a different thing, which comes from outside the container; but at the moment we're not trying to automate bootc-image-builder here, though that also makes sense, see #58 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants