[TUHS] Is it time to resurrect the original dsw (delete with switches)?

Arthur Krewat krewat at kilonet.net
Tue Aug 31 02:46:59 AEST 2021


On 8/30/2021 9:06 AM, Norman Wilson wrote:
> A key point is that the character of the errors they
> found suggests it's not just the disks one ought to worry
> about, but all the hardware and software (much of the latter
> inside disks and storage controllers and the like) in the
> storage stack.
I had a pair of Dell MD1000's, full of SATA drives (28 total), with the 
SATA/SAS interposers on the back of the drive. Was getting checksum 
errors in ZFS on a handful of the drives. Took the time to build a new 
array, on a Supermicro backplane, and no more errors with the exact same 
drives.

I'm theorizing it was either the interposers, or the SAS 
backplane/controllers in the MD1000. Without ZFS, who knows who 
swiss-cheesy my data would be.

Not to mention the time I setup a Solaris x86 cluster zoned to a 
Compellent and periodically would get one or two checksum errors in ZFS. 
This was the only cluster out of a handful that had issues, and only on 
that one filesystem. Of course, it was a production PeopleSoft Oracle 
database. I guess moving to a VMware Linux guest and XFS just swept the 
problem under the rug, but the hardware is not being reused so there's that.

> I had heard anecdotes long before (e.g. from Andrew Hume)
> suggesting silent data corruption had become prominent
> enough to matter, but this paper was the first real study
> I came across.
>
> I have used ZFS for my home file server for more than a
> decade; presently on an antique version of Solaris, but
> I hope to migrate to OpenZFS on a newer OS and hardware.
> So far as I can tell ZFS in old Solaris is quite stable
> and reliable.  As Ted has said, there are philosophical
> reasons why some prefer to avoid it, but if you don't
> subscribe to those it's a fine answer.
>
Been running Solaris 11.3 and ZFS for quite a few years now, at home. 
Before that, Solaris 10. I recently setup a home Redhat 8 server, w/ZoL 
(.8), earlier this year - so far, no issues, with 40+TB online. I have 
various test servers with ZoL 2.0 on them, too.

I have so much online data that I use as the "live copy" - going back to 
the early 80's copies of my TOPS-10 stuff. Even though I have copious 
amounts of LTO tape copies of this data, I won't go back to the "out of 
sight out of mind" mentality.

Trying to get customers to buy into that idea is another story.

art k.

PS: I refuse to use a workstation that doesn't use ECC RAM, either. I 
like swiss-cheese on a sandwich. I don't like my (or my customers') data 
emulating it.


More information about the TUHS mailing list