Author: Arnt Gulbrandsen Date: To: dng Subject: Re: [DNG] Supervision scripts (was Re: OpenRC and Devuan)
Stephanie Daugherty writes: > Service failures should be extraordinary events, and we should
> strive to keep treating them as such, so that we continue to
> pursue stability. Restarting a service automatically doesn't
> improve stability of that software, it works around an
> instability rather than addressing the root cause - it's a
> band-aid over a festering wound.
Unix has a few design choices that tend to produce problems like these,
such as malloc() and its c++ cousin "operator new".
Malloc() is very simple: You ask for memory and get it. The negative side
of that simplicity is that if you're out of memory (and that happens
occasionally if a server is run close to capacity) then processes die
and/or become unresponsive. Such is the tyranny of the Poisson
distribution.
> The failure of a service is analogous in my eyes to the
> tripping of a circuit breaker - it happened for a reason, and
> that underlying reason is probably serious.
Pick your poison: Restart services or add failure handling around all
malloc() calls. I quite like the former in many cases, even though it
papers over various unintentional problem as well as provide the
intentional simplification. But then I like TCP better than NCP, etc.