RAID-5 Array Not Responding Due to One Dropped Drive

An interesting thing happened on the way to backing up the VM server this weekend. And by interesting I mean mindbendingly horrible.

So I’m pulling down the array for a backup. Not that odd in and of itself. And then I start getting spurious IRQ requests (it’s spamming the crap out of the line) that crash out one of the HDDs on the array and render the machine frozen.

OKAY. That’s why we have a RAID-5 array in the first place.

When I bring the machine back up it chokes with some errors on the RAID initiation. Apparently it doesn’t think there are enough drives in the array anymore to bring up the primary /dev/md1 array that holds all our precious data. Excellent, I love spending a weekend de-mucking dead servers :/.

Naturally I don’t want to compound issues, so I pull a backup of each of the bloody 300+ GB drives to a recently verified good (off it’s third RMA.. hardware incompatibility rather than mechanical flaws) 640GB backup drive. This takes roughly a day.. but it’s worth it if the drives die in the middle of a recovery effort.

I’ve got everything pretty straight data-wise, no real fear of doing worse damage at this point. Cracking open mdadm to do a examine on the arrays reveals something a bit weird though. The drives, two of them anyway, show that they’re still okay…

Turns out that the system hit the third drive in the array first, saw that it reported a failure of the entire array, and went no further. The other two drives in the array report as working fine.. and do. I did a –assemble sans funky drive and the array came right up for me to pull a quick backup.

Now I’ll just re-add the “dead” drive to the array and have it re-build once the backup is finished.

So, if you’re staring at an array that won’t come up take a closer look at the mdadm output to make sure it isn’t just hanging on a single debilitated drive. Although I’ve never seen this happen before, restoring a single drive sure beats restoring off backup medium.