:: Re: [DNG] Request file system revie…
Inizio della pagina
Delete this message
Reply to this message
Autore: Rick Moen
Data:  
To: dng
Oggetto: Re: [DNG] Request file system reviews and recomendations.
Quoting Harald Arnesen (harald@???):

> Den 27.12.2017 19:34, skrev Taiidan@???:
>
> > Please remember that all RAID should have ECC RAM and when it comes to
> > XFS it is MANDATORY to avoid massive data corruption.
>
> And a UPS.


To summarise the summary of the summary, concerning the above: I think
many folks are not very good at understanding risk models.


ECC RAM is not sufficient to catch all bad RAM problems, only some.
Back in 2006, I had an interesting case of this:
http://linuxmafia.com/pipermail/conspire/2006-December/002662.html
http://linuxmafia.com/pipermail/conspire/2006-December/002668.html
http://linuxmafia.com/pipermail/conspire/2007-January/002743.html

I know most people won't bother to read that, so I'll summarise: My VA
Linux Systems 2230 2U that was my prototype next-deployment server
showed a perplexing pattern of spontaneous reboots, even though all of
the 512MB of RAM was ECC SDRAM sticks on a server-grade ECC-supporting
Intel L440GX+ 'Lancewood' motherboard. The RAM had also passed long
testing using memtest86. Yet, something about the situation seemed to
still suggest one or more bad RAM stick.

As related in the mailing list links, I found the bad RAM using only logic
and stubborn use of iterative kernel compiles with 'make -j NN' cranked
high enough to exercise all the RAM. (And no, it wasn't a bad memory
socket. I was able to eliminate that.)


There are also far more worrisome causes of filesystem corruption than
bad RAM, not even counting software problems. My one-time colleague Ted
T'so once wrote an excellent piece, that I can't find at the moment,
about how ext2/ext3 code had necessarily been written with a defensive
attitude, to compensate to the maximum possible extent for the ways
commodity PeeCee hardware tends to misbehave, e.g., the way cheap HBAs
often write random garbage inadvertently for a brief while in the
process of losing power when the system gets shut off. T'so observed
that this risk model from commodity hardware didn't exist on, e.g., SGI
hardware built to run IRIX, so the XFS filesystem code on IRIX didn't
need to protect against that form of loss, while ext2/ext3 did.

(It may be that XFS got improved in exactly that area in the years since
the Linux port. I haven't used it since I ran Debian on it during
2001-2.)

Further food for thought:
https://nctritech.wordpress.com/2017/03/07/zfs-wont-save-you-fancy-filesystem-fanatics-need-to-get-a-clue-about-bit-rot-and-raid-5/

-- 
Cheers,            There are only 10 types of people in this world -- 
Rick Moen          those who understand binary arithmetic and those who don't.
rick@???
McQ!  (4x80)