:: Re: [DNG] Supervision scripts
Top Page
Delete this message
Reply to this message
Author: Rainer Weikusat
Date:  
To: dng
Subject: Re: [DNG] Supervision scripts
Hendrik Boom <hendrik@???> writes:
> On Wed, May 04, 2016 at 09:45:24PM +0100, Rainer Weikusat wrote:
>> Stephanie Daugherty <sdaugherty@???> writes:
>> > Process supervision is something I'm very opinionated about. In a number of
>> > high availability production environments, its a necessary evil.
>> >
>> > However, it should *never* be an out of the box default for any
>> > network-exposed service, Service failures should be extraordinary events,
>> > and we should strive to keep treating them as such,
>>
>> That's based on a particular assumption about how 'automatic restarts'
>> will be used, namely, instead of fixing server errors and not as
>> complement to that: I treat 'server failures' as 'extraordinary events'
>> but users don't (and shouldn't): They should experience as litte down
>> time as technically possible.
>>
>> [...]
>>
>> > The second reason is that it will reduce the number of high-quality bug
>> > reports developers receive - if failure is part of the routine, it tends
>> > not to get investigate very thoroughly, if at all.
>>
>> It greatly reduces the number of "low-quality" (or rather, "no quality")
>> bug reports I receive as I don't (usually) get frantic phone calls at
>> 3am UK time because a server in Texas terminated itself for some
>> reason. Instead, I can collect the core file as soon as I get around to
>> that and fix the bug.
>>
>> NB: I deal with appliances (as developer) and not with servers (as
>> sysadmin).
>
> An excellent example of why respawning needs to be an option, and the
> OS should neither force it on or off.


It's technically an option for 'our' system because the service
supervisor/ monitor is just a command which is (or isn't) used as part
of a complete 'server invocation' (usually from a sysv-style init.d
script) and not a Master Control Program and that's what it should IMHO
be. But I'm surely using it for all 'new' servers.

There are other desirable effects of that, eg, the system becomes (to a
degree) self-healing: Say some server can't currently work because of a
file system permission issue (or other transient problem, eg, disk
full): It's sufficient to remedy the specific problem in order to
restore everything to working order as the affected servers will just
start to work the next time they're restarted after the situation
improved. There's no need to go hunting for "stuff that doesn't run
despite it should" and restart it manually (and consequently, no risk to
overlook something).

But leaving these two general remarks aside, I don't quite understand
what you wanted to express.

?