This is the first article in a three-part series on downtime.
Meet Bill Schmidt. He's the CEO of a metal-fabricating business in suburban Cleveland. Bill runs a pretty smooth operation: good sales reps, dependable suppliers, dedicated people on the shop floor and friendly customer service. A few years ago, he computerized the operation and now a capable IT staff keeps his front-end Windows server hosting e-business applications while an IBM iSeries server anchors the operation with general ledger, inventory and shop-floor control.
Bill knows the business and can tell you almost anything about his company. When it comes to specific costs of doing business, well, Bill really knows his stuff. Just ask him about raw-material costs for the last five years, or his monthly utility bill or how much he spent on the big direct-mail promotion in the first fiscal quarter -- he has the answer. Bill carries a sharp pencil when it comes to every-day costs of business.
Success means things are happening fast for Bill. That investment in computer systems a few years back has really started to pay off. Online inventory, order processing and real-time reporting have made all the difference in winning some pretty big deals.
But the increasing reliance on information has put greater demands on system applications and data; now the company cannot function without the systems. But, as with many systems, offline maintenance, upgrades and daily backups require the systems to be unavailable to users. This kind of downtime is painting Bill into a corner. He can't be online and offline at the same time, yet both are required.
Bill understands that system loss, even for short periods of time, is surely costing him something, but despite all his business acumen, he really does not know what. Nor does he have any idea what an unexpected, complete system failure would mean to his business in real dollars. Could a natural disaster put him completely out of business?
Bill's uneasiness about downtime is not unique, but dealing with the issue starts with understanding the issue.
Downtime -- a general condition wherein users cannot use or access computing systems, applications, data or information for a broad variety of reasons -- is a fact of life in computing. Unfortunately, the costs and impact of downtime are generally under-estimated and misunderstood because many business executives assume that a relatively high percentage of system reliability makes discussion of downtime a moot point. But system reliability is not information availability, and therein lies the heart of the issue. A computer can be efficiently running an end-of-day batch job or an operating-system upgrade and yet be "down" to users. The machine would be functioning normally, processing information with all disks spinning and CPU cycles running; nevertheless, no users could attach themselves to the system because the CPU and disks would be used exclusively by the process at hand. The system would be reliable, yes. But it would not be available.
There are two types of downtime:
Planned downtime is scheduled, usually duration-fixed loss of computing-system usage due to operations (such as database backups), maintenance (such as database file modifications or application work) and periodic events (such as hardware/software/operating-system upgrades or disaster recovery testing). Planned downtime exceeding scheduled time slots may transition into unplanned downtime.
Unplanned downtime is unanticipated, duration-variable loss of computing-system usage due to natural disasters, power outages, hardware or software failures and human error/intervention.
Planned downtime accounts for more than 80 percent of all system unavailability, but unplanned downtime, at less than 20 percent, has an inherent shock factor that strikes an emotional chord and causes pre-event anxiety. Unplanned downtime may be more damaging to an enterprise than planned downtime may be. But remember, however big its hit, unplanned downtime usually occurs less frequently while planned downtime almost always occurs more frequently. Over the course of a few years, the repetitive expense of planned downtime may outweigh the relatively infrequent expense of unplanned downtime. It all boils down to scenarios that are specific to individual businesses.
Nevertheless, here is a fact of business that should influence all of our thinking:
Planned downtime must occur and unplanned downtime will occur.
Figure 1: Activity Breakdown of Planned Downtime
Let's look at Figure 1, which shows the activity breakdown of planned downtime. Backup activities, at nearly 60 percent, dominate. Why? Very likely because business growth has pushed many server environments toward 24-hour utilization, shrinking, or even eliminating, the nightly window IT personnel have typically used to capture daily transaction updates and other valuable data to tape, thereby providing a means for a system restore in case of an unplanned event. Furthermore, the 60 percent slice suggests that many businesses still operate in single-server environments or may lack replication software that would allow them to maintain a real-time mirrored image of production on a switch-ready backup server or logical partition of the production server, providing the flexibility to take backups anytime off one server (or partition) while business continues uninterrupted on the other.
Figure 2: Activity Breakdown of Unplanned Downtime
Now, let's also look at Figure 2 on unplanned downtime. There is a surprise. Human error is the single largest reason for unplanned downtime! Trends indicate that employee errors (the human factor), as a percentage of all errors, have been rising steadily over the years. Today, industry analysts estimate that as much as 60 percent of all errors are attributable to human error.
The operations-overruns component of Figure 2 is a measure of jobs or events running outside their planned window. This component has grown as a result of the increased demand for online access and the concomitant growth in the number of transactions processed.
Increased product reliability and the skill of IT personnel to identify potential faults prior to their impact are reflected in the relatively low percentage of unplanned downtime due to hardware failure. Hardware failures have been in steady decline for the last 15 years. But application (software) failure, though also in decline, still remains a fairly significant concern in the unplanned-downtime picture. As viruses have become a relatively common irritant to IT managers, the nature of software errors has changed.
Disasters, on the other hand, contribute only a small percentage to the total problem of unplanned downtime. But we're conditioned by media headlines, so floods, earthquakes, fires and wind storms elicit a strong emotional response. Witness the fact that "unplanned downtime" is most commonly associated with the IT term "disaster recovery." This kind of conditioned thinking tends to draw attention and resources away from the less-sensational planned issues that constitute the bulk of downtime and, in many cases, hit the bottom line harder because of their frequency, as we have previously noted.
Figure 3: Downtime Exposure Based on Percentages of Availability
|Availability Target||Lost Time / Month||Lost Time / Year|
|95 %||36.0 hours||432.0 hours (18 days)|
|97 %||21.6 hours||259.2 hours (10.8 days)|
|98 %||14.4 hours||172.8 hours (7.2 days)|
|99 %||7.2 hours||86.4 hours (3.6 days)|
|99.5 %||3.6 hours||43.2 hours (1.8 days)|
|99.9 %||43 minutes||8.6 hours|
|99.99 %||4.3 minutes||51.6 minutes|
|99.999 %||26 seconds||5.2 minutes|
Figure 4: Hourly Costs of Downtime
Doesn't apply to you? It really applies to all of us, and that is why understanding downtime is half the battle. Contingency Planning Research notes that the average cost of downtime over all industries weighs in at $80,000 per hour, and a leading analyst organization reports that in 2003 all U.S. businesses collectively lost an estimated $9 billion due to downtime. Armed with good working knowledge, however, the business executive and system manager can begin to assess the various options that mitigate system loss in accordance with calculated levels of downtime risk and anticipated ROI.
Clearly, business and technology have become so intertwined that it is difficult to segregate one from the other. Moreover, the requirements for information as a foundation to competitive advantage have never been more pronounced, only increasing the need for real-time access to data and applications. Preeminent business authority Peter Drucker put it best when he said, "High availability and pervasive computing, once considered a strategic advantage, have become a tactical necessity for companies, compelling them to maintain 99.999 percent uptime." 1
He may be correct, but the fact remains that IT managers are still faced with the inevitable -- downtime is not a matter of if, but when.
In our next Managed Availability Memo, "Downtime: Understanding It is Half the Battle -- Part 2," we will look at the tangible, hard costs of system loss. We will also provide two step-by-step formulas -- with supportive explanation -- that will assist readers in calculating downtime costs associated with lost revenue and idled workers. Do not miss this highly valuable article!
DM Review Online readers who wish to study managed availability issues and technology in greater depth may subscribe to Vision Solution's Business Continuity Solution Series at www.visionsolutions.com/BCSS.
1. From "Zero Tolerance for Downtime." www.assureconsulting.com
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access