Re: [DNG] [devuan-dev] ci.devuan.org is down

Skribent: Daniel Reurich
Dato:
Til: dng@lists.dyne.org
Emne: Re: [DNG] [devuan-dev] ci.devuan.org is down

On 19/04/19 20:20, Jaromil wrote:
> On Fri, 19 Apr 2019, Daniel Reurich wrote:
>
>> Hi,
>>
>> ci.devuan.org - our jenkins server is currently down. This is due to a
>> reboot failure after a kernel update that I installed.
>
> this intervention was not planned not communicated; it also was on a
> old infrastructure to which we have no stable reach, because is
> maintained by nextime and therefore needing extra coordination
> measures to insure interventions.

Indeed it wasn't planned. There is more background information which I
had omitted for expediency in order to get the message to nextime to
attend to this outage, and trying to explain all the details was not the
biggest priority.

I was working on that server because I had discovered all source build
jobs would fail consistently at between 4 and 6 seconds with a killed
process. These jobs run on the master node, ie on this server as the
jenkins user. I had discussed the issue with parazyd the day before,
but he could offer no answers as to the consistent build failures across
all the source jobs I'd tried.

I had also discovered during the process that su-ing to the jenkins user
also resulted in the session being closed almost immediately. In both
cases there was nothing appearing in the logs to indicate OOM or other
limitations were being hit.

In order to rule out bugs and based on info I was gleaning from the
jenkins forum, I began upgrading the OS first, and then jenkins and all
the jenkins plug-ins. All these upgrades went smoothly (and solved a
number of security vulnerabilities along the way.
>
> I am unconfortable knowing that anyone of the caretaker can act
> unilaterally on such issues, raising risks of emergency interventions
> which then affect everyone schedule.
>
You may be uncomfortable jaromil, but the fact of the matter is I needed
to rebuild the debian-installer package. Incidentally the last build on
the CI that had been attempted was a couple of weeks ago, the 23rd March
I think. With KatolaZ gone, I'm the only other regular package builder
these days.

Also as far as I'm aware, I'm pretty much the only person who has been
hands on with that server in any meaningful way particularly with
respect to maintenance and support for it. Given that my particular
domain within Devuan has been heavily oriented in the build system then
I think it's reasonable that when it's broke I don't need to wait for a
full committee to get an approval to fix it - particularly given it was
an urgent issue and essentially all builds were broken.

> we do need to coordinate on these tasks and find periods in which
> everyone affected / responsible for the infrastructure bit is
> available.
>
In the normal circumstances, yes I agree that is reasonable. This
wasn't routine maintenance. This was problem solving where I'd spent
many hours over 2 days working on the issue before deciding a reboot was
a reasonable next move.

> I went a long way yesterday urging nextime to help, he is just packing
> today for a trip offline for the coming two weeks and the situation is
> very uncomfortable as works were schedule and still pending also for
> the DNS administration access. He will do his best today to fix that
> so we can rotate the DNS on a new machine.
>
Thank you for this. I do appreciate your efforts and also nextimes.

> after that, we should take the occasion to rebuild the CI with better
> criteria, since the old setup was suboptimal. at dyne we (well, mostly
> parazyd) already setup two more building farms CIs (one for DECODE and
> one for maemo-leste) and have fixed a number of issues. Therefore I
> kindly ask parazyd and ralph and evilham for their availability
> setting up a new CI machine on the ganeti network, where parazyd can
> install and plan a new jenkins instance, which I understand won't cost
> him too much time since he has a well documented and replicable
> procedure for that now.
>
I agree, and I'm happy to work with whomever is interested in getting it
back up and running as soon as we can.

> meanwhile we can simply consider the CI unavailable for the period of
> Easter, which I hope you all manage to enjoy. we needed to fix this
> bit anyway so lets be constructive and do it without letting rush take
> over quality.

That's a reasonable suggestion. But I also have more time flexibility
over easter then in my normal week. So if there is opportunity to
restore service on the original server I'd be happy to do so. But
definitely don't want to continue relying on infra where we can't have
full control.

Regards,
    Daniel

--
Daniel Reurich
Centurion Computer Technology (2005) Ltd.
021 797 722

Dette indlæg hører under følgende tråd:
	Det komplette tråd-træ sorteret efter dato
	Jaromil den
	Daniel Reurich den

Donate to Dyne.org