Generally, when someone hears the term “disaster recovery,” the vision in their minds is one of a smoking hole in the ground where the data center used to be. While certainly a valid concern and image, these types of disasters do not occur very often. Thus, their importance in the mind of the data center manager should be tempered by thoughts of the more likely disaster scenarios that can occur.

What are the more likely disasters that can occur in the data center? In no particular order:

  1. Mission-critical application server suffers a catastrophic failure, rendering it useless for days.
  2. Network failure results in the inability of a business to communicate with the outside world.
  3. Network intrusion results in data loss or application outage.
  4. Disgruntled employee destroys data.
  5. Infrastructure failure results in the loss of systems and/or data.

Any of these can have catastrophic results. Even small, relatively short-term outages can result in significant losses to a business. Coupled with uncertain economic times some of these could actually result in business failure.
Why are these five scenarios (and many others) problematic for many data center managers? In one word: planning. Like most areas of data center management, good planning results in good outcomes. Generally, though, the first time most businesses think about what to do if they lose their email server for a couple of days is immediately after they lose their email server for a couple of days. Had they thought about what to do before the event, the response would undoubtedly be better.

Let’s think about the process for each of these five scenarios:

1. Email server fails.

Problem: The business is unable to communicate internally and externally. Purchase orders cannot be received or processed, etc. Long outages at key times are detrimental to a business.

Issues: With email dead, how else can the business continue to function? What is its plan? The business has to continue, and it will, but what series of actions must be taken to ensure that while the outage is repaired, business is resuming as normal. Most plans focus on how to recover the failed server and do not take into account how to function without it. A plan to restore the server on the fly can be easily developed, but where most businesses fail is in their imagination about how to exist without the failed application. Spend time imagining how to operate without the failed application.

Solution: Restore the application as quickly as possible while using alternative email during the outage. There is not an alternative email? That is failure to imagine.

2. Network failure.

Problem: This is very similar to the critical application failure, but is more widespread, i.e., instead of a single application outage, all applications will suffer. With the proliferation of IP-based phone systems, telecommunication may be out as well.

Issues: Clearly, if this persists for long, business will suffer. Again, two things are going on: diligent work to fix the problem while business is also trying to continue. Did the business imagine how to function if nobody could communicate with its critical applications?

Solution: Have mission-critical applications accessible via some other network. Move the server outside the failed network to one that works. Move the application to some other server.

3. Network intrusion.

Problem: A virus has invaded the data center and rendered some applications unusable and others extremely suspect. In order to minimize the problem, the business turns off external communications and begins to restore servers and applications to known good states. Identical to numbers 1 and 2 above.

Issues: Same as 1 and 2.
Solutions: Create a what-if scenario and imagine how to function.

4. Disgruntled employee.

Problem: This manifests itself in at least two ways: catastrophic destruction of data or through a Trojan horse or other subtle means of data destruction. Simply deleting data is relatively easy to recover.

Issues: Similar to 1, above.

Solutions: The subtle problems a key disgruntled employee can create are much more difficult to imagine solving. Of all of potential disasters, this one could be the worst. Imagination must be highly thorough to plan for this one.

5. Infrastructure failure.

Problem: The air conditioning system has failed on Super Bowl Sunday. Critical servers have overheated and either failed or automatically shut down. Monday dawns and the business is in rebuild mode. This problem is similar to 1, but perhaps more widespread like 2.

Issues: The business is assessing whether to simply open the doors, place some fans and turn the servers back on or did something actually experience a bigger problem.

Solution: Imagine all the possible scenarios.

All of these issues have the same fundamental solution. Businesses must work on two things:

  1. Solve the problem. Restore the application, network, power, A/C, etc.
  2. Continue to operate while solving the problem. In some cases this means moving to a place without the problem.

Good disaster recovery planning works on both of these problems. They are two sides of the same coin. Both must be addressed, especially the need tocontinue to operate. It does not really matter what the disaster or failure is. What matters is what is done about it. A structured approach to imagining what failures are likely and the businesses’ response, both in restoration and continuation, must be used.
For most data center managers, this is really only one problem with multiple symptoms. All of the scenarios outlined above are quite similar. They result in the inability to conduct business. The solution to each of the problems is similar as well: fix the problem while accomplishing business continuance without the failed system. It is the second part that confounds many data center managers.

Planning is the answer. Good data center management has businesses imagining problems that might occur long before they actually happen and developing specific plans for each. These plans must have thoroughly considered responses to the two problems.

  1. What steps must be taken to fix the problem?
  2. What must be done in order to continue business while the problem is being fixed? This part of the plan should assume the worst timeline for step 1.

Most data center managers are pretty good at fixing the problem -especially if given enough time, and even if they have never done it before. Failure often happens when quickly deploying resources and processes to continue the business while solving the problem.
The solution, then, is to realistically consider what disasters the business is likely to face. Categorize these based on their likelihood and develop a comprehensive plan to restore the failure along with a plan to continue to do the business while restoring. The irony is that if you think about the smoking hole, it is clear you must move in order to survive. However, when you have one of the small disasters moving, it does not occur to you. It might, though, if you think about what is likely to occur and imagine your response.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access