Large CRM-supporting data warehouses continue to gain the attention of the industry due to the inherent nature of coping with the size. While terabyte- sized data warehouses are becoming more commonplace, the software and hardware industries supporting them are trying to adapt to a model whereby top performance is achieved without major customizations. However, attaining good performance and throughput for CRM applications using the data warehouse cannot be accomplished without a design that makes this the top consideration.
The largest five percent of databases have always had a challenging time delivering performance without proprietary, extraordinary means. Ultimately, these databases help shape the tools and best practices for the rest of the industry. These days, the majority of the databases shaping the direction of the database industry are data warehouses containing detailed, historical customer information designed to enhance customer intimacy.
Questions of data warehouse size range from how much total disk space is available to how much raw, unexpanded data there is. Total disk size includes system overhead, catalogs, meta data, page overhead, temp space, summary tables, indexes, log space, mirroring and RAID requirements across all the systems. Total disk size typically is four to six times the raw data size. For purposes of discussion, I will focus on the total disk size as the barometer of size as I highlight architectural responses to the CRM data warehouse size challenge. If your data warehouse is destined to scale from under a terabyte to 2+ terabytes online in the next two quarters, you now need a plan to address the acquisition, query performance and other challenges. This is the land of custom extraction, transformation and loading.
Keep in mind the other business drivers which drive an overall configuration response for the data warehouse namely, query service level, number of concurrent users and acceptable downtime and the growth of these which arguably affect overall architecture as much or more than data size.
Terabytes add up when each customer touchpoint becomes important enough to collect. Distribution of this data to CRM-application or user-specific data marts without the concerns we formerly had about data duplication and storage costs is another reason for growth. There is always value in historical data. Until that value is captured, age-off is not possible.
Business change and a desire for the common view of the customer via the data warehouse environment can attract all kinds of data to the data warehouse, most rightfully so. Perhaps the most important benefit the data warehouse can give a company is the pure efficiency of the process of collecting data remotely and storing it centrally. Apply the principle of triage while you're there (on a source system), get all the data you might reasonably need now and in the future.
The Design Response
Numerous summary tables that in the aggregate often take up more space than the base fact data also grow the warehouse. The value proposition here is the distraction of over 80 percent of user queries off of the base fact tables and onto smaller tables. Extending the maintenance cycle to support the summary tables is a necessary design response to the challenge of large data. I will focus on this aspect of CRM-ready data warehouse performance in a future column.
It used to be that partitioning was for very large tables, and parallelism was too complex to consider. These strategies are now commonplace. Indeed, for CRM support, many queries deploy full scans of sizable tables for mining the total history of the customer interaction with the company and the customer associations with the dimensions. Load activity can be hundreds of megabytes nightly as part of complex "n source/n target" acquisition strategies. These are the key resultant data warehouse processes from CRM activity that are the prime considerations for a partitioning strategy.
Many CRM-ready data warehouses deploy hash partitioning to ensure even distribution of data across the partitions and collocation of joins. Other implementations often effectively manage space by key range partitioning. Database management is much simplified with this approach because the table partitions become units of load, indexing, reorganization, delete for aging, backup and query.
Given the large amount of data in a CRM-ready data warehouse, performance is the primary consideration for just about every design decision. From a technology perspective, the temptation is always there to pursue the latest and greatest as a primary consideration. Resist the urge. A performance-based design based on partitioning and summarization coupled with compatible, scalable, parallel technology components will yield far superior performance to dropped-in data models and architectures not designed for performance.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access