Every time an individual comes in contact with an information system, a "digital business event" takes place that creates an electronic record. In fact, much of the annual growth in corporate data volume (estimated by META Group to be 125 percent per year, compounded annually) can be attributed to digital business events created by the proliferation of e-commerce applications, ERP- (enterprise resource planning) or CRM- (customer relationship management) based automated business processes and information networks. The additional demands of radio frequency identification (RFID) data, Web clickstream logs, regulatory compliance information and 3G cellular phone/data call records further highlight efficient management and analysis of this business data as an important competitive advantage.
These digital business events have the following three elements in common:
- They never change once they are created, becoming a permanent history that some type of event has taken place.
- They traditionally occur at high transaction rates and, therefore, generate a large volume of information that needs to be managed.
- They frequently must be saved only for historical reference purposes (e.g., troubleshooting, regulatory or compliance needs), but must be retrieved quickly when needed.
The staggering data management requirements looming on the horizon are put into context when one considers the imminent data volume and data retention demands of major retailers (e.g., Wal-Mart's famous RFID deadlines) and regulatory bodies (e.g., the Securities Exchange Commission). These demands, multiplied by the sheer volume of digital business events that occur every minute, can quickly drive the storage requirements into the petabyte (one thousand terabytes) and even exabyte (one million terabytes) range. At the extreme, Wal-Mart could conceivably generate seven exabytes every day if its inventory were tagged at the per-item level, according to Jim Crawford of management consultancy Retail Forward. By 2006, for example, Sarbanes-Oxley compliance requirements could result in 1.6 exabytes per year per company.
Over the past fifteen to twenty years, the prevailing information management philosophy has accepted that a relational database management system (RDBMS) is the technology of choice whenever data must be available for rapid retrieval and access. This thinking has long suited the industry, as the RDBMS increasingly became the foundation for capturing, storing and managing complex, frequently changing data that is simultaneously accessed by a large number of concurrent users. Digital business event data, however, doesn't match this information profile in virtually any respect, and so it is now time to revisit whether all business information needs to reside in a relational database.
To be clear, the RDBMS provides a safe, secure and functional repository for the data it stores and has remained the conventional solution for data storage and management. However, these relational databases come with an expansive set of functionality that far exceeds what's needed for simply storing and occasionally accessing write-once/read-rarely digital business event data. (It's akin to owning a state-of-the-art PC and only using it for e-mail.)
Matching the Tool to the Task
An RDBMS is assumed to be essential for fast retrieval of data because of one core service it provides: the indexing of data. With an index, an individual piece or pieces of data can be found and retrieved quickly without having to perform a time-consuming sequential search of the entire database. Moreover, because data compression becomes almost mandatory with very large data sets, this makes sequential scanning even more time-consuming when each file has to be decompressed prior to every speculative touch just to see if any useful data lies within it.
The immutable nature of digital business event data simply does not need all of the transactional, integrity and backup safeguards used in the typical operational systems for which commercial RDBMS products are optimized. This requirement/solution mismatch comes with a substantial financial cost, because RDBMS licenses can quickly reach seven-figures. While some organizations, such as financial institutions, are more able than others to spend their way out of any data management problem, there are few who can argue against saving IT dollars by storing large volumes of digital business event data in the most appropriate place, at the most cost-effective price.
It is a little understood fact that large databases are not inherently slow because of the amount of data they contain, but rather because the very large index files needed to support fast access require significant effort to build and manage. The ramifications of this indexing performance bottleneck are seen today in virtually all companies with large database applications - the applications are routinely taken offline for business users so that the data can be loaded and the indexes rebuilt on a nightly or weekly basis. This is a symptom of data overload because loading and indexing large amounts of data, while simultaneously supporting business users, is simply not practical.
Even if an organization has the money available to pay for the hardware, database and application licenses, and the staff required to store all digital business event data in an RDBMS, it would still encounter the performance bottleneck of trying to index large volumes of digital business event data as it is added. As discussed previously, growing data volumes would lead to increasing amounts of application downtime, the impact of which would be seen in lost revenue, increased costs and unmet customer expectations.
Relational databases also resort to partitioning as an approach to managing very large volumes, whereby large volumes are subdivided into smaller more manageable subsets. While this can work well at reasonable volumes, the partitions themselves can become prolific at extreme levels and create a problem in their own right at query time. Prolific partitioning works well to speed data acquisition by keeping busy tables and indexes small to avoid any performance degradation, but queries that need to span across the partitions have to query multiple, if not all, partitions. This becomes a serious issue when there are thousands of partitions, even if you have hundreds of spindles and a raft of CPUs. Partitioning is a partial solution to the burgeoning volume of data, but has limited scalability that falls short of the amount of digital business event data that is expected in future years.
Rediscovering the Flat File
Loading vast quantities of static data into a relational database -- just to support the possibility of future access in response to auditing, regulatory or business process troubleshooting -- does not make much technical or commercial sense when RDBMS licenses are so expensive and the rich functionality provided is so mismatched to the character of digital business event data. Simply getting the large volumes of data into a relational database in a timely manner represents a significant technical challenge, and the high cost of the infrastructure needed to achieve the required performance cannot be justified from a business perspective. Companies need a fresh approach for dealing with the current amount of business-driven event data, and the problem is only going to get worse as data volumes continue to grow.
In an alternative approach, users can:
- Leave the digital business event records in compressed format inside flat files.
- Index the data directly within those flat files.
- Use indexing appropriate for large volumes of data and high acquisition rates.
- Provide SQL access directly into the flat files so existing applications and analytical tools can easily access the information in a familiar format. To the outside world, the data should look just like a table in a relational database.
This leaves the RDBMS for what it does best - handling dynamic data - and does not cripple it with a workload it is not designed to handle.
While many data-driven organizations move archive-bound data to tape, such an approach sacrifices rapid access in favor of the least expensive storage currently possible. Coupled with the ever-decreasing cost of disk-based storage, applying an index to a flat file results in a far more accessible archive that can respond quickly to an enterprise's trend analysis or reporting needs.
Data Center Reexamination Is Long Overdue
Nobody can question the wisdom of using relational databases as the foundation for an enterprise application (such as a call center), supporting thousands of customer service representatives who update tens of thousands of customer records per day. This is a classic use of the power of an RDBMS and whatever the cost required to scale the system to handle the volume of users and transactions, the business value of the information justifies the costs.
In the case of digital business event data, the case is not nearly so clear. This is information that is generated in volumes that dwarf the call center example. (The volume of digital business events for cellular phone carriers is measured in hundreds of millions of transactions per day.) Attempting to scale an RDBMS to manage this information is simply inconceivable given how little of the data will actually ever be accessed again. For these uses, an RDBMS is simply too much functionality at too high a cost. The flat file is a far better storage vehicle for high volumes of digital business event data, with the key being the ability to rapidly index the information in the flat file for fast retrieval when it is needed.
The growing volume of business data and the high cost of relational database management systems both have vendors and consumers at a crossroads. The convenience of using a single technology solution (the RDBMS) as the de facto data management solution has had a good run, but this "one size fits all" approach to data management cannot be justified from either a cost or business value perspective. By reevaluating the types of data businesses generate, it is clear that the ideal data center topology must be some optimal mix of relational technologies and high-performance flat-file databases directed at handling dynamic and static records, respectively. When record-intensive enterprises begin to assess their data centers in this fashion, cost and operational efficiencies over a pure relational approach will quickly present themselves.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access