:: Re: [DNG] Supervision scripts (was …
Top Page
Delete this message
Reply to this message
Author: Steve Litt
Date:  
To: dng
Subject: Re: [DNG] Supervision scripts (was Re: OpenRC and Devuan)
On Wed, 04 May 2016 18:18:02 +0000
Stephanie Daugherty <sdaugherty@???> wrote:

> Process supervision is something I'm very opinionated about. In a
> number of high availability production environments, its a necessary
> evil.
>
> However, it should *never* be an out of the box default for any
> network-exposed service, Service failures should be extraordinary
> events, and we should strive to keep treating them as such, so that
> we continue to pursue stability. Restarting a service automatically
> doesn't improve stability of that software, it works around an
> instability rather than addressing the root cause - it's a band-aid
> over a festering wound.


Good point.

> The failure of a service is analogous in my eyes to the tripping of a
> circuit breaker - it happened for a reason, and that underlying
> reason is probably serious. Circuit breakers in houses generally
> don't reset themselves, and either should network-facing services.


Good analogy, good point.

> The biggest concern in any service failure is that a failure was
> caused by an exploit attempt - attacks which exploit bad
> memory-management tend to crash whatever they are exploiting, even on
> a failed attempt. In an environment where such an event has been
> reduced to routine, and automatic restarts are the norm, that
> attacker gets as many attempts as they need, reducing one of the
> first signs of an intrusion to barely a blip on the radar if the
> systems are even being monitored at all.


Makes sense.

> The second reason is that it will reduce the number of high-quality
> bug reports developers receive - if failure is part of the routine,
> it tends not to get investigate very thoroughly, if at all.
>
> A third reason is convention and expectation. We've lived without
> process supervision in the *nix world for almost 4 decades now, those
> decades of experienced admins generally expect to be able to kill off
> a process and have it stay down.


Using a supervision suite that automatically respawns, the admin can
still down the process and have it stay down. For instance, in a runit
system, if you want to down ntpd and have it stay down, do the following
as root.

touch /etc/sv/ntpd/down
sv down ntpd

That puppy isn't coming back up again til someone removes the down file.

>
> Please consider these factors in any implementation of process
> supervision
> - while it's certainly it's a needed improvement for many
> organizations,, it's not something that should just be on by default.


I see no reason to change our default init any time in the near future.
I think any reasonable init can, one way or another, do both respawning
(what Stephanie calls supervision) and oneshots, where if the process
dies, it's dead and gone.

Epoch has an easy way to do either. Sysvinit can do respawn via
its /etc/inittab, but normally does oneshots.

OpenRC only does oneshots, but by running runit or s6 or
daemontools-encore on top of OpenRC respawning can be had.

With runit and s6, oneshots are simply declared in the rc files that
run before any respawned processes. I think but am not sure that s6 has
provisions to run something as a oneshot. But even if they don't,
post-rcfile oneshots can be kludged with run files something like the
following:

==========================================
#!/bin/sh
mydaemon --run-in-foreground
while ! test -f /etc/sv/mydaemon/down; do
sleep 1000
done
==========================================

I haven't actually tested the preceding, and I'd probably make my own
version of sleep if I actually wanted to do this, but you get the
picture.

SteveT

Steve Litt
April 2016 featured book: Rapid Learning for the 21st Century
http://www.troubleshooters.com/rl21