:: Re: [Dng] vdev update and design do…
Page principale
Supprimer ce message
Répondre à ce message
Auteur: Jude Nelson
Date:  
À: Enrico Weigelt, metux IT consult
CC: dng@lists.dyne.org
Sujet: Re: [Dng] vdev update and design document
Hi Enrico,

> IMHO, chroot should be sufficient for that case - at least w/ proper
> mounting. IIRC, running services in a chroot should be pretty standard
> on *BSD. Anyways, are sure, OpenBSD really has no mount namespaces ?


I'm pretty sure the standard practice in FreeBSD is to use jails instead of
chroot. OpenBSD has a single, global mount namespace--there is no notion
of a per-process mount namespace like in modern Linux. However, the
development philosophy in OpenBSD is to invest efforts into in making the
OS and services hard to exploit in the first place (i.e. ASLR, ProPolice,
privilege separation, etc.), instead of investing in ways to contain an
already-compromised service (i.e. containers, VMs).

The difficulty of relying on chroot for isolationg a session is that it
doesn't get you very much isolation in the first place--it only changes the
root directory in the process, and this can be changed to anything else by
a program running as root. Note that this is *not* the case for FreeBSD
jails or Linux containers, since they also provide a private UID namspace,
so each jail/container has a wholly-separate root account. Specifically,
chroot(2) does not:
* change the working directory (POSIX allows this to be anything, making
escape easy for unprivileged programs);
* hide other users and groups outside the chroot (i.e. it is possible for
there to be UID/GID collisions between chroots);
* hide other processes outside the chroot (assuming you have procfs
mounted);
* hide networking state outside the chroot (a root user in a chroot() can
still affect iptables and interfaces, for example);
* enforce any resource isolation

chroot() is *only* meant to provide a limited view of the filesystem tree,
and that's all OpenBSD services use them for. I would never recommend
using a chroot alone to provide session isolation; you'd have to take
additional steps (maybe pivot_root?)

> OTOH, we could let do vdev do the namespacing magic (eg. based on
> session ident), but still following the suggested approach.


My development philosophy is to build a solution where as many people get
what they want as possible. I want vdev to run on OpenBSD since I use
OpenBSD on one of my laptops and would find it useful, so I need it to be
capable of doing device namespacing on its own. However, I also want it to
run in containers/jails when they're available, and I want it to be
possible to disable per-process access control if the user doesn't want
it. All of this can and will be supported :)

> hmm, we would need some 2-layer approach here:
>
> * layer 1: global context - all available system devices
> * layer 2: session context - only the per-session (virtual) devices
>
> Between these layers, we'd have a mapping (probably defined by the
> session manager), defining which real/system devices are mapped into
> some session context.
>
> Note: my primary goal here is not just access control (for that alone,
> groups and permissions would sufficient, IMHO), but an per-session
> device name virtualization, to ease userland configuration (eg. for
> arbitrary users never ever having to care about proper audio device
> names, etc).
>
> In that context, we'd have separate types of sessions (or perhaps call
> 'em 'scopes'). For example, X servers would run on their own UIDs (one
> per display) - things like vdev mappings here would be defined by the
> display manager. Arbitrary users won't ever get direct access to the
> underlying kernel devices.


Looks pretty sensible so far--it looks a lot like what we do at PlanetLab
[1]. For a given session, what do you think of the following approach?
1. set up the global vdev to listen for device changes from the OS;
2. set up each session vdev to listen to the global vdev for device changes
(i.e. by running inotify()/kqueue() on the global context's /dev);
3. define for each session a set of vdev actions to take when a device is
discovered in the global vdev (example: when the global context's
/dev/audioXXX is discovered, create it as /dev/audio in the session, since
XXX refers to that session's sound card);
4. when starting the session, the session manager starts up a vdev instance
with that session's actions on that session's /dev just before running the
session init. This way, the session vdev runs outside of the session's PID
context, so the contained processes can't easily influence its behavior.
Have the session vdev "discover" all of the devices in the global /dev (so
the renaming rules get processed).
5. When stopping the session, once all of the session's processes have
died, stop the session's vdev instance.

> revoke [snip]


I'm beginning to think that it was a good thing that revoke() was not
accepted into the kernel :) I'm learning more and more edge cases on this
mailing list; if I get another question about whether or not vdev will
support revoke(), I might have to cite you and Hendrik as reasons why not.

-Jude

[1] www.planet-lab.org

On Sat, Jan 3, 2015 at 1:31 PM, Enrico Weigelt, metux IT consult <
enrico.weigelt@???> wrote:

> On 03.01.2015 07:27, Jude Nelson wrote:
>
> Hi,
>
> > I don't disagree with you, especially since namespacing will be
> > necessary when the same device node in each session must refer to a
> > different device. However, as I mentioned in an earlier email, solving
> > the problem of per-process access control by giving each session its own
> > namespace isn't always viable, particularly on OpenBSD (which has no
> > containerization support beyond chroot, and chroot isn't particularly
> > useful for containing processes).
>
> IMHO, chroot should be sufficient for that case - at least w/ proper
> mounting. IIRC, running services in a chroot should be pretty standard
> on *BSD. Anyways, are sure, OpenBSD really has no mount namespaces ?
>
> OTOH, we could let do vdev do the namespacing magic (eg. based on
> session ident), but still following the suggested approach.
>
> > It's also not clear to me that the
> > maintenance burden would be reduced versus using ACLs, since a strategy
> > for populating a given session's /dev and keeping it up-to-date with
> > hotplug events would probably be comparably complex to vdev's ACL system
> > (and this is on top of the container lifecycle management code you'd
> > have to write).
>
> hmm, we would need some 2-layer approach here:
>
> * layer 1: global context - all available system devices
> * layer 2: session context - only the per-session (virtual) devices
>
> Between these layers, we'd have a mapping (probably defined by the
> session manager), defining which real/system devices are mapped into
> some session context.
>
> Note: my primary goal here is not just access control (for that alone,
> groups and permissions would sufficient, IMHO), but an per-session
> device name virtualization, to ease userland configuration (eg. for
> arbitrary users never ever having to care about proper audio device
> names, etc).
>
> In that context, we'd have separate types of sessions (or perhaps call
> 'em 'scopes'). For example, X servers would run on their own UIDs (one
> per display) - things like vdev mappings here would be defined by the
> display manager. Arbitrary users won't ever get direct access to the
> underlying kernel devices.
>
> >> I'd rather raise the question whether that's useful at all.
> >
> > There was an LWN article on this a while back [2]. The examples
> > provided there are as follows:
> > * If the login program could revoke() the tty device node before
> > prompting the password, this attack vector would be removed (assuming
> > the revoke() implementation didn't affect file descriptors in the
> > calling process).
>
> Sure. But shouldn't that potentially attacking process be killed in the
> first place ?
>
> Anyways, if we're talking about local tty, a user sitting on front of
> the console can't even be sure that he's talking to the real login
> program, if he sees some login prompt. Doing trojan attacks here is
> pretty trivial (in fact, that was one of my first easy hacks, back in
> school, I was using to take over our admin's account - what eventually
> lead to /me becoming offical admin ;-)). To prevent that kind of
> attacks, we would need an separate output channel (eg. some special
> screen region, etc) which is exclusive to the real login program and
> cant be touched by arbitrary user processes.
>
> > This also applies to X11, which could revoke() the
> > video device file prior to setting it up.
>
> Same case here. Of course, we have to consider ugly side effects from
> just cutting of processes from a device (still today, crashing X servers
> can leave the display/tty in broken state :().
>
> > * Suppose a process has open files in a filesystem you're trying to
> > unmount. You could revoke all files in the filesystem prior to trying
> > to umount() it.
>
> Whoooh, that's _dangerous_. Yes, forced closing the fd's from kernel
> side would keep the filesystem metadata consisent, but the application
> might get into really weird state if suddenly some fds get lost. Unless
> the application is _known_ to handle that gracefully, it should be
> properly shut down (at least w/ SIGTERM and proper shutdown timeout).
> So, yet another argument for _not_ simply revoking.
>
>
> cu
> --
> Enrico Weigelt,
> metux IT consulting
> +49-151-27565287
>