Wednesday, May 28, 2008

Data Center Power Downs and DR

IF you have been in IT for very long you have experienced a Data Center power down. Of course, the smaller the shop the less of a headache it should be. I've been around about 4 of them, 1 was when the transformer blew up down town and took the then "inflow" data center with it. The company I worked at then just had to wait because the Fire Marshall was not smart enough to understand the complete disconnect between that transformer and the internal systems so they couldn't power up. So far, every time one happens its like a "Chinese fire drill" for the IT staff. Network Guys, Systems Guys, Apps Guys (and Gals of course) all running around trying to get things up in an orderly fashion. I've recently had the "opportunity" to go through another one.
One of the most impressive things I've heard came from our CIO, who emphatically said, "email is NOT a mission critical application." CHEERS!!
However, I think its time every IT shop get off its seat cushions, put a stop to every other project and get one thing crystal clear: What do I bring up first, and How do I do that? You'd think this would be easy, but its not. Every app support person thinks, and wants everyone around them to think, that their app is the most important one. Well, frankly, its not. Every user wants the apps he or she uses to be the ones to come up first. Sorry, but it can't be. The problem is that there are few CIO's with the intestinal fortitude to do what our CIO did recently. Most will cave to the Executives who demand their services back up first. What a CIO should do is demand the realistically most important systems up first. There is a natural, in a sense, order of things. First, infrastructure. Network switches, phone switches, WAN connections, SAN storage, etc. You know, the kinds of things your app HAS to have to function at all. Second, Apps and Services that really are mission critical. Third, everything else as you get time.
The problem comes in at about step 2. What is "mission critical" and what does that mean? Well, here's how I see it. In Health Care that question is answered by asking 1 question: What do I need online to protect the health and well being of existing patients. What systems directly involve patient care, and no email is NOT one of them. We have surgery systems, patient data systems, etc. And, refreshingly that is what our CIO insisted come up first. Once the process was being followed for getting those online all other hands could be involved in getting other services online. The biggest problem I see, even in this last one, is that no one has that laid out in printed form any where. Every system should be tagged with a priority level for DR purposes. I mean, if two systems are down at the same time and you've only got one guy who can work on them, which does he tackle first? I was actually pulled off a clinical system by a certain Director once (at the request of the CIO) because email wasn't working. DOH!
DR really is a major pain in everyone's behind, not because in principle, its hard, but because in practice its hard. In principle its easy to lay out DR. Define the priority of your systems, and Document how to recover those systems. Simple? Right. Simple quickly goes out the window when you start talking about a hot DR data center, Storage Replication, Backups and Recovery, Network Topology and Architecture, etc. etc. Basic DR is simple, real DR is a complicated web decisions and documentation. Basic DR is when the lights go out at your DC. Real DR is when they're not coming back on because the building is destroyed. The question is how much is your business worth, and is it worth putting the time, money and effort into preserving it.
I hate doing documentation, and some companies are in the excessive category when it comes to overburdening their staff with it, but done right, it can save your bacon.

No comments: