Hiccups in the Linux RAID
In December, I made a RAID 1 network storage server by adding a cheap SATA controller and a couple of 500GB hard disks to chinacat. For the most part, it has worked great, but there is one hiccup.
First the good news: the network storage server is working every bit as well as I had hoped and everything that I'd hope to do is falling into place. All of our media files are now consolidated in one place. All of our desktop systems are (finally!) being backed up. Plus, there are all the other groovy benefits of having a shared storage server in the house.
I started the benchmark series because I was concerned with the performance of Linux software RAID. My test results suggested that performance would be acceptable. My real world results underscore that. For a while, I even moved my home directory off of the primary hard drive onto the slower RAID 1 device. I ended up moving it back not because of performance, but to build a more reliable backup architecture.
But there is one problem: every couple of weeks the RAID loses a drive. What happens is that the drive fails and Linux is no longer able to talk to the device. When the drive fails, the log says something like:
Feb 24 08:42:11 chinacat kernel: [823505.900000] sd 3:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK Feb 24 08:42:11 chinacat kernel: [823505.900000] end_request: I/O error, dev sdc, sector 0 Feb 24 08:42:11 chinacat kernel: [823505.900000] sd 3:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK Feb 24 08:42:11 chinacat kernel: [823505.900000] end_request: I/O error, dev sdc, sector 0
Of course, I'm blissfully unaware of the failure, thanks to the magic of RAID. The system switches the RAID over to single-drive degraded mode and just keeps running. The only reason I know there has been a disk failure event is because the RAID system emails me:
To: root [at] unicom [dot] com From: mdadm monitoring Subject: Fail event on /dev/md0:chinacat This is an automatically generated mail message from mdadm running on chinacat A Fail event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb sdc(F) 488386496 blocks [2/1] [U_] unused devices:
The failures have been puzzling. I've been watching the S.M.A.R.T. system, and there have been no indications of drive problems.
The first time this happened I ran basic diagnostics on the drive, but no problems were found. I rebooted the system and it saw the drive fine. I told the RAID to rebuild the drive and it did. Everything worked absolutely fine—until a few weeks later when it happened again.
This situation repeated itself a couple of times. I was trying to think of some excuse to send the drive back to the manufacturer when something surprising happened: the other drive in the RAID failed in the same exact way (and rebuilt fine on reboot).
My suspicion is now turned from the drive to the controller (or to the Linux drivers for the controller). Back in December, I gave a glowing review of the SIIG SC-SAT212-S4 SATA controller. No longer. Until this issue is resolved, I'll have to withhold my recommendation for that controller. Mind you, I'm not blaming the controller yet. I'm just saying something ain't right the setup.