Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

number of loop devices is fixed and unpredictable #1248

Closed
pohly opened this issue Jan 13, 2020 · 10 comments · Fixed by kubernetes/test-infra#16212
Closed

number of loop devices is fixed and unpredictable #1248

pohly opened this issue Jan 13, 2020 · 10 comments · Fixed by kubernetes/test-infra#16212
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@pohly
Copy link
Contributor

pohly commented Jan 13, 2020

What happened:

Block device tests running inside the KinD cluster are flaky (kubernetes-csi/csi-driver-host-path#119). It looks like this is because the Docker container in which kubelet runs has a static copy of the host's /dev. In devfs, new loop devices are added by the kernel as needed. But inside the container only those /dev/loop* devices are available which existed already on the host when the container was created, causing losetup to fail:

losetup: /tmp/testfile: failed to set up loop device: No such file or directory

What you expected to happen:

/dev needs to be shared with the host. This isn't nice from a security perspective, but I don't see another solution.

How to reproduce it (as minimally and precisely as possible):

Create a KinD cluster, pick one container with docker ps, then run losetup repeatedly:

$ docker exec f049e80f378a sh -c 'ls /dev/loop*'
/dev/loop-control
/dev/loop0
/dev/loop1
/dev/loop2
/dev/loop3
/dev/loop4
/dev/loop5
/dev/loop6
/dev/loop7
$ docker exec f049e80f378a sh -c 'truncate -s 1G /tmp/testfile; losetup -f --show /tmp/testfile'
/dev/loop0
....
$ docker exec f049e80f378a sh -c 'truncate -s 1G /tmp/testfile; losetup -f --show /tmp/testfile'
/dev/loop7
$ docker exec f049e80f378a sh -c 'truncate -s 1G /tmp/testfile; losetup -f --show /tmp/testfile'
losetup: /tmp/testfile: failed to set up loop device: No such file or directory

Environment:

  • kind version: (use kind version): v0.6.0-alpha
@pohly pohly added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2020
@BenTheElder BenTheElder self-assigned this Jan 13, 2020
@BenTheElder
Copy link
Member

I'd generally be concerned mounting all of dev that we'll wind up touching other devices.

  • If we pre-allocated more on the prow nodes would that work?
  • How many do we need? is it predictable at all?
  • Can we reduce our dependency on this and / or use something else that is namespaced?

@pohly
Copy link
Contributor Author

pohly commented Jan 14, 2020

If we pre-allocated more on the prow nodes would that work?

Yes, but it's just a band-aid. But sometimes those are enough 😁

How many do we need? is it predictable at all?

It's not predictable how many will be needed. Pre-allocation would impose a limit that doesn't exist otherwise. On the other hand, calling mknod /dev/loopX b 7 X a thousand times would be cheap (it just creates a device node entry in the filesystem, without actually creating the device in the kernel) and that "should be enough for everyone" (famous last words...).

Can we reduce our dependency on this and / or use something else that is namespaced?

In kubelet, the loop device is used to keep a file descriptor for the device and thus the device itself open without having a process running which owns that file descriptor. I'm not aware of any other Linux kernel mechanism which achieves that.

The CSI hostpath driver could perhaps provide a block device via some other way (Network Block Device = NBD), but a loop device is the simplest approach.

What might work is to change the implementation of https://github.com/kubernetes/kubernetes/tree/master/pkg/volume/util/volumepathhandler:

  • instead of invoking losetup, do the necessary ioctl calls directly
  • in the situation where losetup fails to open a non-existent /dev/loopX, create that device node and continue

@BenTheElder
Copy link
Member

/priority important-soon
too many things going on :/

prow would probably accept running a daemonset that allocated many loop devices on the build nodes if we could hack one up, we have a daemonset right now for increasing inotify watches ...

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 30, 2020
@pohly
Copy link
Contributor Author

pohly commented Feb 3, 2020

@BenTheElder Do you have a pointer to how that existing daemonset is written?

@BenTheElder BenTheElder added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 10, 2020
@BenTheElder BenTheElder added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Feb 10, 2020
@BenTheElder
Copy link
Member

bumping up due to flakes, we'll need to get kubernetes/test-infra#16212 or stop running these tests

@BenTheElder
Copy link
Member

not quite closed yet, needs to be deployed.

@BenTheElder BenTheElder reopened this Feb 11, 2020
@BenTheElder
Copy link
Member

kubernetes/test-infra#16230 is deployed now

@BenTheElder
Copy link
Member

status update: @pohly filed kubernetes/test-infra#16278, which is now deployed

we discussed in slack and confirmed on a prow build node that /dev/loopN are block now. 🤞

@BenTheElder
Copy link
Member

#1333 for more followup

brauner pushed a commit to brauner/linux that referenced this issue Apr 22, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
unchanged
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
brauner pushed a commit to brauner/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
fengguang pushed a commit to 0day-ci/linux that referenced this issue Apr 23, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
brauner pushed a commit to brauner/linux that referenced this issue Apr 24, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
brauner pushed a commit to brauner/linux that referenced this issue Apr 24, 2020
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc5 ("seccomp: add a return code to trap to userspace")
[2]: fb3c538 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: kubernetes-sigs/kind#1333
     kubernetes-sigs/kind#1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     moby/moby#27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <[email protected]>
Cc: Steve Barber <[email protected]>
Cc: Filipe Brandenburger <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Benjamin Elder <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Tom Gundersen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Kellner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Dylan Reid <[email protected]>
Cc: David Rheinsberg <[email protected]>
Cc: Akihiro Suda <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- David Rheinsberg <[email protected]> /
  Christian Brauner <[email protected]>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <[email protected]>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <[email protected]>:
  - Fix loopfs_access() to not care about non-loopfs devices.
  - Stash refcounted sbinfo in lo_info to simplify retrieval of user
    namespace. This way each loopfs instance just takes a single
    reference for each to the user namespace that is dropped when the
    last loop device is removed. This puts us on the safe side. (Thanks
    to Serge for making me aware of this issue.
- David Rheinsberg <[email protected]> /
  Serge Hallyn <[email protected]>:
  - Remove "max" mount option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants