Continue in 2 seconds

Introducing the Data Warehouse Appliance, Part 1

Published
  • March 01 2005, 1:00am EST

This column is excerpted from the white paper, "Introducing the Data Warehouse Appliance."

ap*pli*ance n., 1: a device designed to perform a specific task. 2: the application of something.

How do appliances fit into our business intelligence world? Solutions adopting the appliance nomenclature - as in "data warehouse appliances" - have emerged as viable short-list solutions for new or refurbishing data warehouse efforts.

These solutions are devices designed to perform a specific task. Vendors such as Datallegro, Netezza and others now dot the landscape (with representation in analyst "quadrants" and "spectrums" and close to $100M in venture backing) and require understanding and attention. It's important to note that no two vendors are alike in their approach, offerings and target market. Some vendors, for example, are strictly software providers, while others such as Datallegro and Netezza include the platform.

Many of the arguments for data warehouse appliances are similar to the ones we've heard from the enterprise resource planning (ERP) and prepackaged software markets. Just as those products are designed to save customer time by jump-starting the configuration with features of supposed widespread applicability, the data warehouse appliance is a hardware/software/OS/DBMS/storage preconfiguration for back-end data warehousing requirements that have been honed over the last 12 years of data warehousing.

If you believe, as I do, that information is the contemporary competitive landscape, you will want to do everything you possibly can to exploit that asset. You don't want restrictions on bringing in new data, manipulating that data or accessing that data. Today, typical data warehouse environments have several restrictions that we've "learned to live with." These include continual upgrade cycles, batch windows, limited usage, hardware complexity, long-running queries, summarizing data, the need to "age off" older data and having many vendors to deal with. There's a point at which overcoming these challenges ceases to be "easy" and affordable.

Perhaps more important are the restrictions in client mind-sets to the possibilities of information exploitation. Once the culture has accepted restrictions for multiple years, it can tend to quit dreaming, quit making revolutionary leaps and limit its query sessions to three levels deep instead of 10. It can stop the data access rollout at half the knowledge-workers instead of all of the knowledge-workers, customers, supply chain partners and broader potential consumers of the data. It can deal with day-old instead of second-old data. It can remove year-old POS, RFID, clickstream or CDR data from the warehouse even though the data still has utility. It can ignore complex data types such as flat files, XML, graphics and spreadsheets.

The promise of the appliance is the alleviation of these restrictions. It's not so much that traditional MPP architectures are failing, but their price points and complexity have left them vulnerable.

The data warehouse machine preconfiguration concept is not new. Teradata, Britton Lee and Sequent are obvious examples of this approach. Hardware and software vendors have commonly preconfigured hardware, operating systems, DBMSs and storage to alleviate those tedious, commodity tasks (as well as the risk of not getting it right) from the client. Well-worn combinations such as those put forward for TPC benchmarks are available from either the hardware or software vendor (to "ready prompt") in most cases. However, some of the new aspects of the modern data warehouse appliance are the use of commodity components and open source DBMS (or, in the vendor terms, DBMS alternatives) for a low TCO. These open source DBMSs provide a starting point for basic database functionality, and the appliance vendors focus on data warehouse-related functionality enhancements.

The current parallel RDBMS developments had their origins in 1980s university research on data partitioning and join performance. Resultant data movement or duplication to bring together result sets can be problematic, especially when gigabytes of data need to be scanned.

To be continued next month.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access