As more companies begin measuring their data stores in petabytes, the realities of data proliferation become very apparent. In fact, the proliferation of the data proliferation topic itself is out of control. It’s ironic that Microsoft Word doesn’t recognize petabyte as a real word, yet the corporation’s data stores have reached petabyte levels. Information is exploding, and it has become virtually impossible for companies to keep up.


Consider Wal-Mart. The chain has more than 6,000 stores, and some have almost a half-million SKUs each. You think your Excel spreadsheets from finance are bad? Wal-Mart’s database tables have literally 100 billion rows. The retailer’s POS systems have to ring up some 276 million items – in one day.1


For D&B, the world’s leading supplier of business information, services and research, data storage and management are priorities. Its database contains statistics on more than 100 million companies in more than 200 countries, including the largest volume of business-credit information in the world.


Experts guess that Google has at least 20 petabytes of data stored on its servers. To put that figure into perspective, Paul Smalera pointed out that the process of downloading one petabyte over your high-speed internet connection would probably be complete somewhere around the year 2514.2


The Information Explosion


Companies managing massive volumes of information are faced with significant challenges today, including some intense acquisition issues in their business intelligence (BI) supply chain. Acquiring new data and data sources is paramount to enriching existing data products and creating new products. Information companies may regularly acquire data from literally thousands of sources in hundreds of different formats at varying frequencies. And the tolerance for latency of new data becomes less and less by the day.


According to Gavin Whatrup, group IT director at marketing services company Creston, companies today are being buried beneath an avalanche of data. Data capture, such as RFID, is generating huge amounts of additional data. Regulations dictate that we retain it, and competitive pressures demand we make use of it. Data management, and that includes the storage technology behind it, and knowledge management are going to be key technologies in the battle to remain operationally compliant and commercially competitive.3


Hubert M. Yoshida, vice president and chief technology officer of Hitachi Data Systems, predicts an even more dizzying future. Yoshida said, “Currently, companies basically use petabytes of data, but in the next few years, their data will increase to exabytes.”4


Planning for this information onslaught is a formidable challenge. Aside from the issues associated with legacy systems in place, it’s even more challenging to plan for 2x, 4x and 8x scenarios of existing terabyte data stores. Companies are beginning to approach this challenge, but data proliferation often ends up being the material that keeps CIOs and CEOs awake at night.


This article will discuss system recommendations in future planning for a robust, cost-effective and scalable system, including inbound, processing and outbound BI concerns.


Low-Cost Computing Platform


In order to accommodate explosive data growth through new channels and ever-increasing volumes, your company’s computer platform must be cost-effective now and in the future. Grid-based platforms are an excellent choice, offering a low-cost extensible platform. Hardware is available from a variety of vendors, and hosted solutions are also available.


Mature Operating Environment


Ideally, you should rely on a system that has a history of success and predictability, so look for an industry-proven operating environment. And, although you want a state-of-the-art system, be sure that your company will not be the quality assurance (QA) department for a product to be used in a new and different way. Clearly there is a difference between leading edge and bleeding edge. Make sure your company stays on the former, unless your project – and career – has a high tolerance for risk.


To do that, look for an operational environment that has been production tested in similar scenarios to your requirements. Do your homework. Ask tough questions. And ask for proof of concept (POC) when it appears that your scale may be an order of magnitude larger than anything that’s done in production.


Strong and Flexible Data Acquisition Tools


There are several good ETL tools out there. These days, ETL tools can do much more than in the past, but they can also lead you into performance issues if used as the answer for every problem.


If data cleansing rules are complex and frequently changing, consider utilizing a commercial rules engine to implement. A commercial rules engine can store cleansing rules in a pseudo-English language format that can be edited or changed by business analysts instead of relying solely on programming resources.


Be certain to consider the total cost involved with your choice. Prices vary widely, but license fees for some ETL tools can have a significant impact on your budget. Also, ETL tools can become expensive without proper licensing on machines where they operate.


High-Performance Analytic Environment


Analytics means different things to different people, and there are many good tools available. In the financial services and risk arenas, analytics usually means scoring based on a set of criteria as part of the “secret sauce” for each institution. These calculations are usually performed before served up for consumption in BI tools. Also, in trend data such as retail, data may need to be aggregated, projected to a larger universe, or used to create a number of derived measures prior to BI.


High Performance BI/Mart Server Engine


Once data is in an operational data store, there are many good engines that can serve up the data to a plethora of great BI tools available today. Properly tuned engines from and others can provide speedy query responses and flexible data extraction for OLAP tools and reporting. These engines work well with a variety of visualization tools to create sophisticated dashboard views and deep drill down into the data.


A Combination of Flexibility and Value


All three areas – inbound, processing and outbound – must provide adequate flexibility in a cost-effective fashion. You can always throw more hardware and software at a problem, right? Design for the worst case and manage for the best results. Make sure your technical and financial model scales upward with growth.


Data Warehousing and Software Development Best Practices 


Be sure to follow best practices so that business need drives the technology, and ensure that the business sponsor understands the data model, capabilities and limitations of the new system. Consider the following.5


Best Practices in Data Warehouse Design


  • Functional requirements or business processes should drive the logical data modeling, not the end-user criteria or physical implementation.
  • Develop the data model during project definition.
  • Consider data quality as one of the major decision drivers in design.
  • Report data quality issues on an ongoing basis to the business.
  • End-to-end response commitments – and other quality criteria – should be based on business requirements, not on what is technically possible.
  • Get the business project sponsor to understand and sign up to the data model. Involve the sponsor during definition of the business data model, not afterwards.
  • Translate the data model into a list of written rules and obtain business approval.
  • Size the database on the basis of proven rules and criteria.
  • The corporate data architecture team should sign off on project data models.
  • The project data architect should sign off on module specifications.
  • Define source/target mapping at attribute level during logical/physical data modeling.

Avoid Moving Data


Ideally, a multiterabyte warehouse design should avoid moving or copying data unnecessarily. Copying and moving terabytes of data is a slow process, measured in hours and even days, depending on size.


In some industries, however, many commercial entities must transport large quantities of data on a regular basis. Univa CTO Steve Tuecke noted that companies in the oil and gas, automotive, semiconductor and pharmaceutical industries often depend on accumulating and moving large data sets.6


Benchmark/POC in Selected Environments


To get a better understanding of potential results, test where you can. If there are no similar test results available from a given vendor/platform, set up a test environment on a trial basis with a representative sample. Test or simulate data with business rules representative of your target environment.


The Right Partner


Look for a vendor with significant expertise on the tools, platforms and production environment or operation under consideration. Your vendor’s knowledge and guidance will be invaluable as you go through the decision-making process.


Your vendor should also have specific experience in business operations similar to yours. The more experience a vendor has with operations and concerns specific to your business, the better your results. And finally, be certain that you choose a vendor who has proven to be both a good student and a good listener – not one who is just looking to close a deal this quarter.


While no one has a crystal ball detailing exact requirements, data proliferation will certainly continue. Data is exploding, and data volumes will only increase in the future. In order to accommodate, store and utilize massive quantities of data, you need to make sure you go with the best available technology to meet the BI requirements for your clients.


Gather as much information as you can, and conduct thorough research and testing. Talk to others who have gone through the same process. Be cognizant of vendor pricing strategies for tools on various platforms and consider the total cost of ownership.


If data management/data warehousing is not a core competency, outsourcing can be an ideal solution. If outsourcing makes sense for your company, look for a vendor that has significant expertise in the tools, platforms and production environment under consideration. The more experienced the vendor, the more valuable the guidance.


Set up operational service level agreements and, if the system is brand new, set them up after running the new system for a few months.


With clear objectives, good planning, careful preparation and knowledgeable guidance, your company can develop a flexible, cost-effective solution to handle your data needs now and in the future.



  1. Evan Schuman. “Wal-Mart’s Plans for its 4-Petabyte Database.” StorefrontBacktalk, August 3, 2007.
  2. Paul Smalera. “Google’s Secret Formula.” Portfolio, September 2007.
  3. Andy McCue. “CIO Jury: Data Overload Drowning Businesses.”, November 23, 2006.
  4. Izwan Ismail. “The Perks of Eco-Friendly Data Storage.” Data Storage Today, July 12, 2007. 
  5. Shimant Das. “Data Warehouse Design Best Practices, Part 1.” DM Review, September 14, 2004.
  6. Steve Tuecke. “Enterprise Taking Notice of Grid's Data Management Capabilities.” Globus Consortium Journal, February 2006.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access