-
Notifications
You must be signed in to change notification settings - Fork 13
Create installation VM and run bootc install inside a VM using rootless podman #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR introduces an Sequence diagram for the new install command workflowsequenceDiagram
actor User
participant CLI as podman-bootc install
participant Podman
participant VMBuilder as VM Image Builder
participant Libvirt
participant VM
participant Proxy as VSOCK Proxy
User->>CLI: Run 'podman-bootc install ... -- bootc install ...'
CLI->>Podman: Build VM image container (make image)
CLI->>Podman: Extract disk image from container
Podman-->>CLI: Disk image extracted
CLI->>Libvirt: Define and start VM with disk image
Libvirt->>VM: Launch VM with virtiofs and vsock
CLI->>Proxy: Start vsock proxy (unix <-> vsock)
Proxy->>VM: Bridge unix socket to VM's vsock port
CLI->>Podman: Connect to Podman API inside VM via proxy
Podman->>VM: Run 'bootc install' inside VM
VM-->>Podman: bootc install output
Podman-->>CLI: Installation complete
CLI-->>User: Output results
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
@germag @cgwalters I added a first PR to create the installation VM (not for testing it yet). I'd love to have your opinion on the approach. I still find the install command pretty lenghty, and also the fact that the user needs to specify the full bootc commandline after the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alicefr - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
- Inverted error check on Podman connection (link)
- Error message formatting has mismatched verbs and arguments (link)
General comments:
- In ExtractDiskImage the Podman connection error check is inverted (
if err == nil { return err }
), so it never returns real errors—change it toif err != nil
. - The temporary container created for exporting the VM disk is never removed, leading to dangling containers; add a cleanup step (e.g. containers.Remove) after export.
- fetchLogsAfterExit wraps a non-zero exit code in an error but never returns it, so failures are ignored—update it to return that error to propagate bootc errors.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In ExtractDiskImage the Podman connection error check is inverted (`if err == nil { return err }`), so it never returns real errors—change it to `if err != nil`.
- The temporary container created for exporting the VM disk is never removed, leading to dangling containers; add a cleanup step (e.g. containers.Remove) after export.
- fetchLogsAfterExit wraps a non-zero exit code in an error but never returns it, so failures are ignored—update it to return that error to propagate bootc errors.
## Individual Comments
### Comment 1
<location> `pkg/podman/podman.go:86` </location>
<code_context>
+ return err
+ }
+ ctx, err := bindings.NewConnection(context.Background(), fmt.Sprintf("unix:%s", socketPath))
+ if err == nil {
+ return err
+ }
</code_context>
<issue_to_address>
Inverted error check on Podman connection
The check should be `if err != nil { return err }` to handle errors correctly.
</issue_to_address>
### Comment 2
<location> `pkg/podman/podman.go:108` </location>
<code_context>
+ if err := extractTar(pr, dir); err != nil {
+ return err
+ }
+ log.Debugf("Extracted disk at: %v", dir)
+ return nil
+}
</code_context>
<issue_to_address>
Dangling Podman container not cleaned up
Please remove the temporary container after exporting its filesystem to prevent resource leaks.
</issue_to_address>
### Comment 3
<location> `pkg/podman/podman.go:226` </location>
<code_context>
+ if err != nil {
+ return fmt.Errorf("failed to wait for container: %w", err)
+ }
+ if exitCode != 0 {
+ fmt.Errorf("bootc command failed: %d", exitCode)
+ }
+
</code_context>
<issue_to_address>
Non-zero exit code error isn’t returned
The error should be returned to ensure failures are not ignored. Replace with `return fmt.Errorf(...)`.
</issue_to_address>
### Comment 4
<location> `pkg/podman/podman.go:248` </location>
<code_context>
+ return fmt.Errorf("failed to start the bootc container: %v", err)
+ }
+
+ if err := fetchLogsAfterExit(ctx, name); err != nil {
+ return fmt.Errorf("failed executing bootc: %s %s: %v", err)
+ }
+
</code_context>
<issue_to_address>
Error message formatting has mismatched verbs and arguments
There are three placeholders in the format string, but only one argument is provided. Update the format string or supply the missing arguments to prevent a runtime panic.
</issue_to_address>
### Comment 5
<location> `cmd/install.go:145` </location>
<code_context>
+ OutputPath: c.outputPath,
+ Root: false,
+ })
+ if c.installVM.Run(); err != nil {
+ return err
+ }
+
</code_context>
<issue_to_address>
Ignored error from InstallVM.Run due to incorrect if syntax
Use `if err := c.installVM.Run(); err != nil` to correctly capture and check the error from `Run()`. The current syntax ignores the returned error.
</issue_to_address>
### Comment 6
<location> `cmd/install.go:105` </location>
<code_context>
+ if c.outputPath == "" {
+ return fmt.Errorf("the output-path needs to be set")
+ }
+ if c.configPath == "" {
+ return fmt.Errorf("the output-path needs to be set")
+ }
+ if c.containerStorage == "" {
</code_context>
<issue_to_address>
Error message references wrong flag for config-dir
The error message for `configPath` should mention `config-dir`, not `output-path`. Please correct the error text.
</issue_to_address>
### Comment 7
<location> `pkg/vm/installvm.go:129` </location>
<code_context>
+ }
+ defer dom.Free()
+ if err := dom.Destroy(); err != nil {
+ logrus.Warning("Failed to destroy the domain %s, maybe already stopped: %v", vm.domain, err)
+ }
+ if err := dom.Undefine(); err != nil {
</code_context>
<issue_to_address>
`logrus.Warning` used with format string but isn't formatted
Replace `logrus.Warning` with `logrus.Warningf` to correctly handle the format string.
</issue_to_address>
### Comment 8
<location> `pkg/vm/domain/domain.go:220` </location>
<code_context>
+ args := []string{"info", imagePath, "--output", "json"}
+ cmd := exec.Command(path, args...)
+ logrus.Debugf("Execute: %s", cmd.String())
+ stderr, err := cmd.StderrPipe()
+ if err != nil {
+ return "", fmt.Errorf("failed to get stderr for qemu-img command: %v", err)
</code_context>
<issue_to_address>
Using `StderrPipe` with `cmd.Output` may deadlock or lose stderr
`cmd.Output()` should not be used with `cmd.StderrPipe()` as it can cause deadlocks or missing stderr output. Use `CombinedOutput()` or handle stdout and stderr manually with `cmd.Start()`.
</issue_to_address>
### Comment 9
<location> `Makefile:26` </location>
<code_context>
e2e_test: all
ginkgo -tags $(build_tags) ./test/...
+image:
+ podman build -t $(vm_image) --device /dev/kvm \
+ -f containerfiles/vm/Containerfile \
</code_context>
<issue_to_address>
Mark `image` as a phony target
Add `image` to `.PHONY` to prevent issues if a file named `image` exists.
</issue_to_address>
### Comment 10
<location> `pkg/vsock/proxy.go:121` </location>
<code_context>
+ }
+}
+
+func (proxy *Proxy) proxyFileToConn(ctx context.Context, file *os.File, conn net.Conn, errCh chan error) {
+ go func() {
+ _, err := io.Copy(conn, file)
</code_context>
<issue_to_address>
Consider merging the two proxy helper methods and related context logic into a single, symmetric io.Copy pattern to simplify the code.
```suggestion
You can collapse the two `proxyFileToConn`/`proxyConnToFile` helpers and the `context`/`select` machinery into a single, symmetric `io.Copy` pattern. This greatly reduces nesting and boilerplate while preserving cancellation via `p.done` in the accept loop.
1) Simplify Start/accept loop:
```go
func (p *Proxy) Start() error {
_ = os.Remove(p.socket)
ln, err := net.Listen("unix", p.socket)
if err != nil {
return fmt.Errorf("listen %q: %w", p.socket, err)
}
go func() {
defer ln.Close()
for {
conn, err := ln.Accept()
if err != nil {
select {
case <-p.done:
return
default:
logrus.Warnf("accept error: %v", err)
continue
}
}
go p.handleConnection(conn)
}
}()
logrus.Debugf("Started proxy at: %s", p.socket)
return nil
}
```
2) Flatten handleConnection with two `io.Copy` goroutines:
```go
func (p *Proxy) handleConnection(unixConn net.Conn) {
defer unixConn.Close()
fd, err := unix.Socket(unix.AF_VSOCK, unix.SOCK_STREAM, 0)
if err != nil {
logrus.Errorf("vsock socket error: %v", err)
return
}
sa := &unix.SockaddrVM{CID: uint32(p.cid), Port: uint32(p.port)}
if err := unix.Connect(fd, sa); err != nil {
logrus.Errorf("vsock connect error: %v", err)
_ = unix.Close(fd)
return
}
vconn := os.NewFile(uintptr(fd), "vsock")
if vconn == nil {
logrus.Error("failed to wrap vsock fd")
_ = unix.Close(fd)
return
}
defer vconn.Close()
// symmetrical copy
errCh := make(chan error, 2)
go func() { _, err := io.Copy(vconn, unixConn); errCh <- err }()
go func() { _, err := io.Copy(unixConn, vconn); errCh <- err }()
if err := <-errCh; err != nil && err != io.EOF {
logrus.Errorf("proxy copy error: %v", err)
}
}
```
3) Remove the unused `proxyFileToConn`, `proxyConnToFile` methods and the `context` imports. This preserves all behavior but slashes lines of code and cognitive overhead.
</issue_to_address>
### Comment 11
<location> `pkg/vm/domain/domain.go:89` </location>
<code_context>
+ }
+}
+
+func allocateDevices(d *libvirtxml.Domain) {
+ if d.Devices == nil {
+ d.Devices = &libvirtxml.DomainDeviceList{}
</code_context>
<issue_to_address>
Consider introducing a WithDevices helper to eliminate repeated device allocation and append logic in each With* function.
```go
// Add this helper to collapse the allocateDevices + append boilerplate:
func WithDevices(fn func(devs *libvirtxml.DomainDeviceList)) DomainOption {
return func(d *libvirtxml.Domain) {
if d.Devices == nil {
d.Devices = &libvirtxml.DomainDeviceList{}
}
fn(d.Devices)
}
}
```
Then refactor each With* to use it. For example, replace:
```go
func WithFilesystem(source, target string) DomainOption {
return func(d *libvirtxml.Domain) {
allocateDevices(d)
d.Devices.Filesystems = append(d.Devices.Filesystems, libvirtxml.DomainFilesystem{
Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
})
}
}
```
with:
```go
func WithFilesystem(source, target string) DomainOption {
return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
devs.Filesystems = append(devs.Filesystems, libvirtxml.DomainFilesystem{
Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
})
})
}
```
And similarly for WithDisk, WithSerialConsole, WithInterface, and WithVSOCK:
```go
func WithDisk(path, serial, dev string, diskType DiskDriverType, bus DiskBus) DomainOption {
return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
devs.Disks = append(devs.Disks, libvirtxml.DomainDisk{
Device: "disk",
Driver: &libvirtxml.DomainDiskDriver{Name: "qemu", Type: diskType.String()},
Source: &libvirtxml.DomainDiskSource{File: &libvirtxml.DomainDiskSourceFile{File: path}},
Target: &libvirtxml.DomainDiskTarget{Bus: bus.String(), Dev: dev},
Serial: serial,
})
})
}
```
This cuts out the repeated `allocateDevices(d)` and anonymous wrapper in each With* function, reducing boilerplate while preserving all behavior.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
args := []string{"info", imagePath, "--output", "json"} | ||
cmd := exec.Command(path, args...) | ||
logrus.Debugf("Execute: %s", cmd.String()) | ||
stderr, err := cmd.StderrPipe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Using StderrPipe
with cmd.Output
may deadlock or lose stderr
cmd.Output()
should not be used with cmd.StderrPipe()
as it can cause deadlocks or missing stderr output. Use CombinedOutput()
or handle stdout and stderr manually with cmd.Start()
.
Makefile
Outdated
@@ -18,6 +23,11 @@ integration_tests: | |||
e2e_test: all | |||
ginkgo -tags $(build_tags) ./test/... | |||
|
|||
image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Mark image
as a phony target
Add image
to .PHONY
to prevent issues if a file named image
exists.
pkg/vsock/proxy.go
Outdated
} | ||
} | ||
|
||
func (proxy *Proxy) proxyFileToConn(ctx context.Context, file *os.File, conn net.Conn, errCh chan error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider merging the two proxy helper methods and related context logic into a single, symmetric io.Copy pattern to simplify the code.
func (proxy *Proxy) proxyFileToConn(ctx context.Context, file *os.File, conn net.Conn, errCh chan error) { | |
You can collapse the two `proxyFileToConn`/`proxyConnToFile` helpers and the `context`/`select` machinery into a single, symmetric `io.Copy` pattern. This greatly reduces nesting and boilerplate while preserving cancellation via `p.done` in the accept loop. | |
1) Simplify Start/accept loop: | |
```go | |
func (p *Proxy) Start() error { | |
_ = os.Remove(p.socket) | |
ln, err := net.Listen("unix", p.socket) | |
if err != nil { | |
return fmt.Errorf("listen %q: %w", p.socket, err) | |
} | |
go func() { | |
defer ln.Close() | |
for { | |
conn, err := ln.Accept() | |
if err != nil { | |
select { | |
case <-p.done: | |
return | |
default: | |
logrus.Warnf("accept error: %v", err) | |
continue | |
} | |
} | |
go p.handleConnection(conn) | |
} | |
}() | |
logrus.Debugf("Started proxy at: %s", p.socket) | |
return nil | |
} |
- Flatten handleConnection with two
io.Copy
goroutines:
func (p *Proxy) handleConnection(unixConn net.Conn) {
defer unixConn.Close()
fd, err := unix.Socket(unix.AF_VSOCK, unix.SOCK_STREAM, 0)
if err != nil {
logrus.Errorf("vsock socket error: %v", err)
return
}
sa := &unix.SockaddrVM{CID: uint32(p.cid), Port: uint32(p.port)}
if err := unix.Connect(fd, sa); err != nil {
logrus.Errorf("vsock connect error: %v", err)
_ = unix.Close(fd)
return
}
vconn := os.NewFile(uintptr(fd), "vsock")
if vconn == nil {
logrus.Error("failed to wrap vsock fd")
_ = unix.Close(fd)
return
}
defer vconn.Close()
// symmetrical copy
errCh := make(chan error, 2)
go func() { _, err := io.Copy(vconn, unixConn); errCh <- err }()
go func() { _, err := io.Copy(unixConn, vconn); errCh <- err }()
if err := <-errCh; err != nil && err != io.EOF {
logrus.Errorf("proxy copy error: %v", err)
}
}
- Remove the unused
proxyFileToConn
,proxyConnToFile
methods and thecontext
imports. This preserves all behavior but slashes lines of code and cognitive overhead.
} | ||
} | ||
|
||
func allocateDevices(d *libvirtxml.Domain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider introducing a WithDevices helper to eliminate repeated device allocation and append logic in each With* function.
// Add this helper to collapse the allocateDevices + append boilerplate:
func WithDevices(fn func(devs *libvirtxml.DomainDeviceList)) DomainOption {
return func(d *libvirtxml.Domain) {
if d.Devices == nil {
d.Devices = &libvirtxml.DomainDeviceList{}
}
fn(d.Devices)
}
}
Then refactor each With* to use it. For example, replace:
func WithFilesystem(source, target string) DomainOption {
return func(d *libvirtxml.Domain) {
allocateDevices(d)
d.Devices.Filesystems = append(d.Devices.Filesystems, libvirtxml.DomainFilesystem{
Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
})
}
}
with:
func WithFilesystem(source, target string) DomainOption {
return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
devs.Filesystems = append(devs.Filesystems, libvirtxml.DomainFilesystem{
Driver: &libvirtxml.DomainFilesystemDriver{Type: "virtiofs"},
Source: &libvirtxml.DomainFilesystemSource{Mount: &libvirtxml.DomainFilesystemSourceMount{Dir: source}},
Target: &libvirtxml.DomainFilesystemTarget{Dir: target},
})
})
}
And similarly for WithDisk, WithSerialConsole, WithInterface, and WithVSOCK:
func WithDisk(path, serial, dev string, diskType DiskDriverType, bus DiskBus) DomainOption {
return WithDevices(func(devs *libvirtxml.DomainDeviceList) {
devs.Disks = append(devs.Disks, libvirtxml.DomainDisk{
Device: "disk",
Driver: &libvirtxml.DomainDiskDriver{Name: "qemu", Type: diskType.String()},
Source: &libvirtxml.DomainDiskSource{File: &libvirtxml.DomainDiskSourceFile{File: path}},
Target: &libvirtxml.DomainDiskTarget{Bus: bus.String(), Dev: dev},
Serial: serial,
})
})
}
This cuts out the repeated allocateDevices(d)
and anonymous wrapper in each With* function, reducing boilerplate while preserving all behavior.
Hmm so in this model we'd ship that container image at e.g. Or do you see the fact that this is a host binary as just a transient state? (As I was arguing in my prototype of bootc-kit). The host binary vs container image is a really important decision that impacts a lot of things... We could I guess to some degree try to do both...but it could get ugly. BTW this also relates strongly to bootc-dev/bootc#1359 (comment) in that I think having e.g. A use case we will definitely want to support is one where we are invoked as part of other tooling. For people that want to do things via containers I hope it will feel really natural to basically do |
|
||
func DefaultContainerStorage() string { | ||
usr, err := user.Current() | ||
if err == nil && usr.Uid != "0" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use podman system info --format=json
; there may be an API for this? But I dunno I have no problem with fork/exec where it's easy personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you care to connect to the service using a binding, it is there yeah. else you could also consider podman info --format '{{.Store.GraphRoot}}'
pkg/vsock/proxy.go
Outdated
"golang.org/x/sys/unix" | ||
) | ||
|
||
const podmanVMProxyDefault = "/tmp/podman-bootc-proxy.sock" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we can avoid a predictable filename in /tmp
...probably via using /run
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely
} | ||
|
||
func WithOS() DomainOption { | ||
// TODO: fix this for multiarch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we omit this and have libvirt pick a default?
containerfiles/vm/Containerfile
Outdated
@@ -0,0 +1,38 @@ | |||
FROM quay.io/fedora/fedora:42 as builder | |||
|
|||
ENV URL https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/x86_64/images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a live chat about this before but: the more I think about this the stronger I feel that in the general case we really do need to use the target kernel to install.
It's the only way we can handle skew where only the target kernel knows how to do a particular filesystem type. Yeah today the Fedora kernel enables most filesystem types, but say (before bcachefs was merged) someone wanted to use that for their systems. With this model they'd have to override both the build environment container and the target container image.
Further, I think we can still run into issues where (unless one is careful with filesystem feature flags) this kernel may use filesystem feature flags not enabled by default in the target.
So I think the path we should go down here is where we do a direct kernel boot w/qemu, fetching the kernel from the target bootc image (which should always be in /usr/lib/modules/
; we could add a helper command to extract that)
And if we're using the target kernel, there's going to be skew unless we use the target userspace too...
And so yes this sounds like a "build a hammer to make a hammer" problem but I think we can get out of that by building on the direct kernel boot above and then running viritofsd against the target container's merged root.
In other words we create a workflow "run a bootc container as an ephemeral VM w/virtiofs root" basically.
I think it'd be a big advantage to not need to maintain (and have users keep updated) this "builder VM" image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words we create a workflow "run a bootc container as an ephemeral VM w/virtiofs root" basically.
Which would be super useful as a general thing (especially for tests). This overlaps a lot with osbuild/image-builder-cli#189
I wonder actually if it's core enough that we could put the functionality for this directly...in podman perhaps? Really today even on Linux podman already links in a ton of code to do qemu/virtiofsd.
Doing some searching around this...I had forgotten about https://github.com/containers/crun-vm ...so much going on in this space of VMs and containers and bootc! I now feel bad for basically never trying to add that one into my "inner loop".
I tried out that project right now and none of the examples work for me (not the containerdisk one or the bootc one)...but I think we can make it work. EDIT: apparently I was running as root and that doesn't work. It's pretty good Rust code...
I think especially it's already in a place where it'd be well positioned to at least do this ephemeral VM w/virtiofs root approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To say this a different way, both this PR and krun (eventually via libkrunfw) would have the problem that if we wanted to productize it for RHEL, we'd need to change this so we aren't trying to ship a different kernel (and userspace!) build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Colin for the review. Those are really valid points. We could use podman create + mount in order to assemble the bootc container filesystem, and the booting from the artifacts in the merged directory. The current issue I'm facing is that for rootless podman, mount requires unshare and then libvirt cannot find the directory with the kernel, initrd and root since it runs in a different mount namespace.
We could extract the entire bootc image filesystem, but it is quite large to do it each time, and I'd like to find a smarter solution if possible.
Extracting the kernel and the initrd is pretty straightforward, but we are always missing the rootfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgwalters An alternative can be to use the image volumes. It is already supported by podman and kubernetes, so eventually we could use it there as well.
In this way, the bootc image can be made available to any other container, included the one with the sdk:
Example:
$ podman volume create --driver image --opt image=quay.io/centos-bootc/centos-bootc:stream9 bootc-vol
$ podman run -ti --rm -v bootc-vol:/bootc-data quay.io/fedora/fedora:latest ls /bootc-data/lib/modules/5.14.0-590.el9.x86_64/{vmlinuz,initramfs.img}
/bootc-data/lib/modules/5.14.0-590.el9.x86_64/initramfs.img /bootc-data/lib/modules/5.14.0-590.el9.x86_64/vmlinuz
I think this is a pretty elegant approach to make it flexible based on the desired booc image.
However, this implies that libvirt needs to have access to the bootc volume, hence, probably running in the same container of the sdk, or a second container using the same volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing some searching around this...I had forgotten about https://github.com/containers/crun-vm ...so much going on in this space of VMs and containers and bootc! I now feel bad for basically never trying to add that one into my "inner loop".
The problem with crun-vm is that libvirt is still running on the host not inside the container. First, it won't have access to the image volume. Second, it also needs to be installed on the target host, while if we ship everything in a container image (including libvirt), then it doesn't need to be available in the target host.
The second point might be not strictly necessary if we only plan to deploy this on a local laptop, but better if we use the sdk image in a distributed CI system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative can be to use the image volumes
Yes. Though note it's also possible from the CLI and API to skip the explicit volume create
step:
$ podman run --rm -ti --mount=type=image,source=quay.io/fedora/fedora-bootc:42,target=/target busybox ls -al /target/usr/lib/modules
total 0
drwxr-xr-x 3 root root 36 Jun 11 03:47 .
drwxr-xr-x 41 root root 153 Jun 11 03:47 ..
drwxr-xr-x 7 root root 27 Jun 11 03:47 6.14.9-300.fc42.x86_64
$
However, this implies that libvirt needs to have access to the bootc volume, hence, probably running in the same container of the sdk, or a second container using the same volume.
Right, in the general case (and relating to your previous comment) I think we probably need to handle three different cases:
Running on Linux, unprivileged podman
Most of the time this is a Linux desktop. I think it's fine to assume some software gets installed on the host, e.g. qemu and we can assume host has e.g. libvirt (at least optionally for persistent pet machines).
Running on Linux, root in e.g. Github Action runner (w/nested KVM available)
This is basically the "headless CI job" case of the above. Of course it could also be non-root.
Running in Kubernetes with /dev/kvm
mounted in (current coreos-assembler flow)
Here we must have all userspace (e.g. qemu) in the sdk image. We can probably assume https://kubernetes.io/docs/tasks/configure-pod-container/image-volumes/ is available.
I think it's a bit of an open question if the middle case looks more like the first or the last. For GHA, I think it's totally sane to have a setup phase which does apt install libvirt
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the third case, in kubernetes, we will need a device plugin which allocates /dev/kvm for the container. Otherwise, it will require to run the pod as privileged to use hostpath and a volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the third case, in kubernetes, we will need a device plugin which allocates /dev/kvm for the container.
Yeah I made https://github.com/cgwalters/kvm-device-plugin
That said, I did that because at the time the people running the cluster didn't want to scope in installing KubeVirt. But that's long in the past, and maybe it makes sense to just require KubeVirt, even if the use case is "parallel" to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes KubeVirt would solve the problem. They should also now have the support for image volumes, but I need to double check, it was an ongoing effort. They support direct boot from containerdisk, the part I'm not sure is how to pass the rootfs with virtiofs from an image volume. This part might be still missing
@germag @cgwalters after the discussion in #95 (comment) , not sure if podman-bootc is the right place for this. My main reason why I choose it, it was because it was written in go and has the concept of running bootc VM. I think golang has the advantage over rust that we have libvirtxml, and libvirt and podman binding which simplify the integration quite a bit The command is pretty self-contained so it can be moved to a standalone tool |
Well...I think an outcome of the work here is definitely that we have "podman-bootc successor", and there's some useful code here for sure. For example #58 is still relevant too. I'd just say again I think an issue is that the scope of what we need from "sdk" is a lot larger than what this project does today (and the above wrapper for b-i-b is taking it beyond what either "podman" is (e.g. uploading AMIs) or what bootc is itself); also adding things like the "install with anaconda" wrapper flow or cloud provisioning too. That said what may be really nice is to have a "core sdk" which only deals with operating system independent things (e.g. only knows about bootc, podman, systemd, libvirt say) that lives at... Anyways though procedurally I don't see a problem in landing code here. Or we could fork it if that's easier. This is all definitely a very complex topic because this project has the gravity of being a naive binary today, but today bootc-image-builder comes as a container (and PRs like the above need to deal with skew between them), and then other things (exactly like the cloud image container added here) also come as a container and I think things will be simplest for us in the longer term if we basically always have the SDK logic run in a container (like coreos-assembler), with just an optional shim binary that can be installed on the host as a convenient shortcut. |
} | ||
usr, err := user.Current() | ||
if err == nil && usr.Uid != "0" { | ||
return "/run/user/" + usr.Uid + "/podman/podman.sock" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be had from podman info
or via the bindings
@cgwalters Based on the recent discussions I tried something different. This code doesn't reflect my experiments yet, but they are avaialble at https://github.com/alicefr/bootc-appliance-demo/tree/dynamic-boot. Since we want to use the target bootc image as installation environment I have switch to direct boot using the bootc image as rootfs.
Please, note that the VM has access to the host container storage so we can use the original bootc image to perform the installation. This picture summarize the step visually: I think this setup facilitates the 3 scenarios since we have already containerized the virt stack. We can always move podman-bootc and podman inside their own container. We control podman remotely though a socket. So, this can be easily shared with a volume in case we decide to deploy it separately. |
@cgwalters please let me know what you think and I can translate my scripts in my repo into go code and update this PR accordingly |
That picture is cool 😄 One point I'll make is I'd like to not "stop energy" ourselves, I'd love to get something useful here even if it's not ideal and we can always iterate from there - perfection being the enemy of the good etc. So to that point, from my PoV I'd say if you have something even partially working here I think we could really consider pretty quickly merging it into podman-bootc git main (as long as we're not breaking the functionality that exists now). |
One thing I'll say here is there two competing things in this install phase:
One thing I have definitely been thinking about more is whether we could push more for using systemd Anyways, let's keep up with using |
First we should default to taking the target image as an argument, I did
I don't think we need to do a build of a new image (for that we'd need to consider GC, concurrency issue) - we can just run the target container image and inject that configuration dynamically on top as a command right? Think of it like a Kubernetes pod with an injected entrypoint shell script.
To be clear then is this |
I think a good way to describe a target goal to start is: "The functionality of podman-bootc but without the requirement for podman machine on Linux, which we handle by creating a transient lightweight bootstrap VM using the target container image itself" or so. Maybe I was too aggressive in saying though that this tool should come as a container image itself. A native host binary may actually be a lot more ergonomic in the end, or again we may in the limit need to do both to some degree. |
Sure, what can be done in the additional layer, can be done dynamically. The thing I like in splitting the VM configuration and the booting is that we can cache the intermediate image/layer with the dependencies. For example, if want to do multiple builds from the same image or something fails in between, then we can start from this layer. Additionally, it works well with image volumes.
Yes, it is a layer for caching the installation VM configuration steps generated dynamically on top of the bootc image. |
I think if we keep the virtualization stack in the container in all the scenarios and on the host podman-bootc and podman, then it become very easy to containerized them if necessary. Since it basically becomes podman run inside a container. |
Had to do diff --git a/demo.sh b/demo.sh
index 126ec81..8967021 100755
--- a/demo.sh
+++ b/demo.sh
@@ -38,7 +38,6 @@ podman run -td --name libvirt \
-v /dev/kvm:/dev/kvm \
-v /dev/vhost-net:/dev/vhost-net \
-v /dev/vhost-vsock:/dev/vhost-vsock \
- -v /dev/vsock:/dev/vsock \
-v $OUTPUT_DIR:/usr/lib/bootc/output \
-v $CONFIG_DIR:/usr/lib/bootc/config \
-v $CONT_STORAGE:/usr/lib/bootc/container_storage \ But otherwise it worked well. Nice job! Some feedback: I love the idea of booting a VM right off of the container image. Almost feels like that flow could itself be the main experience in a developer's iteration cycle? It doesn't seem very different from what you'd get if you were to build live PXE artifacts from your container image and boot from that. What's really interesting though is that it defers the more expensive "create disk image" path until you need it (and sure, for some developers, maybe you will almost always need that depending on what you're hacking, but for end users wanting to iterate on their configuration bits, it's good enough while providing more fidelity than testing as a container image).
Hmm, looking at the Containerfile, it's mostly writing files so it doesn't seem worth caching. The exception is |
On the containerize vs native discussion, when you think about the building and testing-as-a-container cycle locally, all you need is podman. I think it's a compelling story if the additional functionality we build here (testing-as-a-VM, testing via Anaconda, etc...) doesn't expand that requirement. And yeah, as mentioned that plays itself out in instrumenting this in CI/k8s too. Relatedly, to repeat something from #28, I find the "run a bootable OCI image as a VM with a single podman command" story really powerful. I still use my code from there even now even though it's not as efficient. Not saying it should be a hard requirement or anything, but it deserves some weight I think in the design. |
Ah, yes nice catch!
Socat can be replaced by a tiny binary which does the proxy. It can be statically compiled and injected dynamically eventually, so it isn't an hard requirement. Ideally, we could also extend podman to support vsock protocol, and the proxy won't be required anymore. So, AFAIU, you would prefer to skip the intermediate image, and inject the pre-steps every time we start the container? |
I think since we have moved the virt stack inside the container, now the only part that is on the host are the client libraries to talk to libvirt and podman remote. So, this command can be, now, easily integrated in podman or be containerized. The requirements for this setup are:
|
Right, sorry I forgot about that PR in the mix of all the things here. Thanks so much for starting that and agree it should carry a lot of weight here! |
BTW I think |
@jlebon @cgwalters so you are both against the intermediate image layer? It's just to know how to proceed, so then I can translate into code what I have in my demo repo. |
I guess as a general rule what I'd say is we should as much as possible avoid side effects on other global state. The intermediate image layer is just one example of that. Another is the I think it will be a lot cleaner without that intermediate layer, but we should not let perfection be the enemy of the good - so if it helps please do just do that in the initial translation! There's so many shell scripts and such in this area with various suboptimal tradeoffs that I think it's not a super high bar to clear to do better. Wait sorry though - backing up to a higher level, where are we with having src == target (ref previous thread). That one I'm more concerned about; for some reason I had thought we'd resolved that but the current demo.sh is bootstrapping via Fedora Cloud. WDYT about that idea of bootstrapping via a virtiofs root? Another alternative I can see here is where we support a |
Thanks for the review I will try to avoid the intermediate layer and reduce the shell scripts. A lot of they can be converted in the sdk code. |
@alicefr how's it going? This issue is pretty near the top of my list, anything I can help with? |
btw @jlebon this "build using kernel from target" is among the big reasons why I think what we do in coreos-assembler today should also be replaced by this. It just causes "host contamination" really to do anything else (esp in a world where coreos-assembler is always using fedora content) |
@cgwalters I'm finishing the second version, sorry I'm splitting my time and not very quick on this. I hope I'am able to push the new version today |
The vm image contains the virtualization stack to launch the virtual machine, the files to compose the VM configuration in the bootc-data volume. The entrypoint prepares the bootc-data volume in order to be able to boot from the bootc image mounted at that location plus the configurations required by the installation. The systemd services mount the virtiofs targets, while the podman-vsock-proxy starts the proxy from VSOCK to the local unix socket for podman. The virtiofs-wrapper script is required in order to add extra option when virtiofs is launched by libvirt. This is necessary in order to correctly launch virtiofs inside an unprivileged container. Signed-off-by: Alice Frosi <[email protected]>
The podman package abstract the methods interacting with podman. It mainly contains the methods in order to launch the vm container with libvirt and QEMU, and create the remote container for bootc inside the VM. Signed-off-by: Alice Frosi <[email protected]>
The proxy translates the unix socket to the vsock and viceversa: $ vsock-proxy --log-level debug -s /run/podman/podman-vm.sock -p 1234 \ --cid 3 --listen-mode unixToVsock Based on the listen-mode flag, it proxies in the connection. In the above example, for each connection at the unix socket specified by -s, it connects to the VM with cid 3 and port 1234. Signed-off-by: Alice Frosi <[email protected]>
Go doesn't offer a generic method to convert a type in to its pointer. This method is practical for the podman bindings where there are several field which are pointer, and without this method, you would need to define a variable and then pass its address. E.g for booleans: foo := utils.Ptr(true) instead of t := true foo := &t Signed-off-by: Alice Frosi <[email protected]>
The domain package helps to abstract and builds a libvirt domain. Signed-off-by: Alice Frosi <[email protected]>
The installation VM is a temporary VM used for running privileged commands using rootless podman on the host. This VM boots from the kernel and initrd available in the bootc image. As rootfs, it uses the rootfs from the bootc container image and virtiofs. Signed-off-by: Alice Frosi <[email protected]>
The install command runs bootc install inside the installation VM. In this way, it is possible to run privileged command using rootless podman. The install command requires to specify: - the bootc image to use for the installation - the configuration directory where to find the config.toml for configuring the output image - the container storage directory, this is passed up to the remote container in order to perform the bootc installation using the container image - the output directory where to locate the build artifacts - the name of the output disk image Example: $ podman-bootc install --bootc-image quay.io/centos-bootc/centos-bootc:stream9 \ --output-dir $(pwd)/output --output-image output.qcow2 --config-dir $(pwd)/config \ -- bootc install to-disk /dev/disk/by-id/virtio-output --wipe Signed-off-by: Alice Frosi <[email protected]>
@cgwalters @germag PTAL. The newv version doesn't include the intermediate image layer anymore. It boots the VM from the bootc image, and the container entrypoint prepares the configuration for the installation. If you like the current approach I can add the unit tests and clean the code a bit further. Thanks! |
We do nowadays always use the target userspace to create disk images (well, technically for RHCOS only, but it matters less for FCOS anyway since it's already Fedora). So e.g. the |
This is true, but there's still huge possibility of skew I think. |
Thanks, I took a quick try at running the code and got I tried this
Which got me a little farther to an error about a stopped container; haven't debugged that yet. |
OK I think I see the overall issue here. There's a large tension here if we want the virt stack on the host or in a container image. Chatting with @germag yesterday, he brought up the pain point of having container images as CLI tools makes integration with the shell/system annoying. So yes, maybe it's less painful to just require virt tools on the host. The binary already links to libvirt.so so it's not really portable, and we may as well just require host tools. I think it'd still be good to support running in a container image with virt tools embedded, but I see eventually this tooling being embedded in larger container images (e.g. like coreos-assembler) so us having a prebuilt container image is of less value. (Though it's nonzero) |
Here's two patches I wrote while debugging so far
|
With the first patch, it's seemingly quite involved to do an Related to this, previously: #61 That took me a long time to debug and I think that issue will come back if we keep trying to use the exec/attach API here. Yeah, forking processes feels unclean sometimes, but... |
Before doing a further version of this code, I really would like to clarify this point. If we run everything in a container, we avoid this problem completely.
Not sure I really understand this point, what are the issues here?
True, we can also containerize it.
it is always possible to move podman-bootc into a container. The only thing it requires is to have the ability to control the podman on the host, and this is more or less how docker-in-docker works. |
I will take a look, sorry for the bug |
@cgwalters @germag the other point I really would like to address is that we should simplify the command like. As first draft, I'm passing the full bootc cmdline, but this is wrong imo. Certain part of the command line are fixed, like the paths and how the device is showing up inside the guest. This part requires further tuning as well |
@cgwalters I'm not a big fan of calling podman directly, especially if we want to containerize this. Otherwise, it will require to have podman install as well, while we might just use podman remote. |
@cgwalters Is this really necessary or was it for debugging? My idea, here, was to rely on podman default for pulling the image. We might also choose to always pull the image or when we have a new version, so this part doesn't seems complexly correct |
@cgwalters my bad, the reason why this is failing is because you need to build the image first. We haven't, of course, released the image yet, so before to try the command, you need to run |
the podman/docker default is |
Is it correct that the status quo for this PR is that we require a host binary and a container image? That seems like it's going to combine the disadvantages of both, and we'll need to think about skew, right? The container image is basically "libvirt + vsock proxy + config" and aside from the vsock proxy we could get that from anywhere right? e.g. in a use case for "host libvirt" it can easily come from the host. I'd like to drill down into the vsock proxy a bit; ideally we can ship that somewhere by default (maybe in podman?). -- Different topic, I don't understand why we are requiring a
The idea with |
This PR adds a new
install
commands in order to run bootc command using rootless podman inside a VM. First, themake image
builds a container image including the virtualization stack.The
install
command creates a container with libvirt and boot the VM from the bootc image. Together with the VM, it spans a proxy for creating a unix socket and connecting to the VSOCK port of the installation VM. Once, the connection is establish it is able to remote control the podman inside the VM and execute the bootc installation.The output disk image is passed to the VM and identified by the device
/dev/virtio-output
. This path remains constant.Note
The image used by the
install
command hasn't been released yet (since this is part of this PR), so you need to build it locally before being able to run the command. You can build it by runnigmake image
Example:
Main differences with podman machine: