This is one of those lessons that I hate having to relearn, but it’s a valuable one. You can take this one to the bank.
Remember to reseat your system’s cables, drives and any other connections that can be disconnected.
When you’ve checked all your event logs and the technical problem is clearly hardware-related, but you can’t find it… take a few minutes to power down, open it up, then disconnect and reconnect everything.
I was reminded of this lesson yesterday after battling a failing logical drive on the backup system’s array controller (an HP Smart Array 6400).
I’ve used HP array controllers for decades and never run into this particular problem. They just work. And in the rare instances when they fail, I’ve always been able to simply swap in another array controller from the same family and have the system back up in no time.
Now, this particular system was running just fine for months, until last Wednesday when it started “acting up.” By that I mean that our backup server’s operating system started complaining about array errors – but at that point performed as expected. It claimed to have found a bad drive… disabled it and spun-up the hot-swap drive for the mirror. I installed a replacement drive (which I knew to be good) to the array and after a few moments it too was marked bad. HUH? Even more stunning was the fact that data had actually been corrupted on the array and had been detected by the operating system, even though the hardware controller still had the logical drive up and running and usable. (Generally speaking, a good array controller like this one will fail the logical drive before data corruption occurs.)
It certainly is possible that two drives in a row could go bad, but highly unlikely. I swapped in another and it too was marked bad after a few minutes. That’s no longer just a coincidence! I spent more time debugging, moving data to another drive, recreating logical drives, etc. etc. etc. all of which ended up being useless… The error logs in the operating system and hardware were misleading and useless — all they reported was that the drives were bad, and finally on Monday the server gave up and marked EVERY logical drive on the controller as bad. Whoops! Now what?
Well, it was my colleague that suggested “maybe it’s time to reseat.”
Fifteen minutes later, after opening the server and reseating the controller and also reseating every hard-drive in the external bays we were powering up the server. Next, after logging in and using HP’s Array Configuration Utility to mark the “bad” logical drives as usable… we were greated by a happy server with happy logical drives. No data lost.
Friends… save yourself some time! Don’t beat your head against the wall for days while replacing perfectly good hardware. If your system can be powered down for a few minutes, give it a shot. Unplug your system, open it up and reseat every connection. Power cables, data cables, memory chips, cooling fans, I/O boards, hard-drives, etc.
Even if the system is bolted to a slab of concrete and seems completely immobile and steady, the process of thermal expansion can cause connections to wiggle free. Even temporary heavy loads on a system can increase temperature enough to expand components, pulling connections apart just enough to break a connection or worse cause an indeterminate signal level to exist between components which can really screw things up.
This story should also serve as a good reminder that it’s always good to let another set of eyes in on your problem, because you may be so focused on a particular solution path that you miss something simple.