Speed is a five-letter word that confronts the database world. Database, system and network administrators have long been trying to find the perfect balance between performance, bandwidth and uptime. While business intelligence applications mandate a quick response time and near realtime application availability, C-level executives question total cost of ownership and the return on investment. How do you cope with the ever-increasing demand for faster performance without spending many thousands of dollars every time?
Technology strides in data warehouse architecture and hardware have added the data warehouse appliance, a nice architecture component that can be implemented within the existing infrastructure to complement the performance and availability needs for BI solutions. During the past three years, there has been a move to embrace the appliance even more within the IT organization. This article shares a case study demonstrating how the appliance can be integrated into the existing architecture to achieve optimal performance and scalability while keeping costs minimal.
What is the biggest issue that keeps you awake at night? I pose this question to IT and C-level executives alike when invited to discuss a problem. Depending on who you are talking to, typical answers range from My query that runs for 10 minutes in a 100,000 records database runs for 10 hours in a 100,0000 records database, My users keep telling me that extract, transform and load (ETL) processes failed due to large data volumes in the data warehouse, My CFO wants to know why we need another $500,000 for infrastructure when we invested a similar amount just six months ago, or My database admin team is exhausted trying to keep the tuning process online for the databases to achieve speed. The consistent theme that has been expressed is about the need to keep speed of response up while not allowing for extra spending.
The key factors that are driving performance in a data warehouse include:
Data loading Data loading into the data warehouse is one of the longest processes. The reasons for this are manifold and have been discussed in depth in various ETL articles and case studies. To summarize, the process of extracting various data feeds, processing them through data quality and data profiling and loading them with or without transformations to a final destination is time-consuming. This is especially true when the input volumes are low, small bursts of data and the speed is impacted by the volume of data in the data warehouse.
Data availability Data availability service level agreements (SLAs) have a profound impact on the need to have a highperformance environment. For data to be pristine, integrated and available for downstream applications such as reporting and analytics, end user needs must be clearly documented. Additional aspects that organizations often fall short of are data growth projections, data demand projections, data retention cycles and associated SLAs that have not been documented.
Data volumes Data volumes in the data warehouses within organizations have been exploding by gigabytes every day. Growth rates for transactional data at its granular level have increased by 50 percent in the past three years. Reasons for this data volume explosion are compliance, legal mandates and business users.
Sarbanes-Oxley (SOX), the Health Insurance Portability and Accountability Act (HIPAA), the Gramm-Leach-Bliley Act (GLBA), and payment card industry (PCI) regulations have mandated that structured data must be retained online long enough to meet e-discovery and compliance requirements. Unfortunately, the overall performance and response time of current major relational database management system (RDBMS) platforms suffers as the database size increases. This is very much a fact in databases over 1TB, which is now rapidly becoming the starting size of data warehouses with two to three years of data.
Recent class-action lawsuits in the pharmaceuticals, manufacturing and tobacco industries have left companies feeling the legal need to retain data for a longer period. Although all the data does not have to be online, the data structures and associated metadata must be maintained whether online or offline to satisfy these needs.
Business leaders are increasingly seeing the value of information gathered from exercises in data mining and exploration, historical analysis and predictive intelligence. Being able to see the history and predict a potential value proposition in the future on similar activities in the business, including promotions, campaign and other similar revenue generation opportunities results in a demand for storing data longer.
Storage performance Disk and storage systems have been consistently improving over the years in terms of speed and performance, while costs have been relatively stable if not less expensive. Architecturally, storage is shared across all areas of a data warehouse, making it a highly constrained area in terms of availability and performance. ETL and BI queries are not small traffic in nature and consume a lot of space and network bandwidth. If multiple queries are fired off on this shared storage architecture, even best-in-class hardware and disk are not going to enable faster query processing and a lightning response time for results. If we add mixed query workloads on the storage architecture, we start to see slower and slower performance cycles resulting in a poor query performance and highly constrained network in terms of bandwidth. Even continued strides in improving the overall storage performance will not make it more optimal for data warehousing. Faster is better does not fit in this space.
Operational costs The operational cost of running and maintaining a data warehouse has been monumental in many organizations. Especially with the granularity of the data growing deeper and the need to store history growing longer, a two-way explosion has resulted in an unmanageable amount of information to be handled by the data warehouse. Additionally, multiple kinds of related activities like data mining and predictive and heuristic analysis have placed a heavy demand on resources both in hardware and in IT administration (such as the database administrator [DBA], system administrator and network administrator roles). The overall cost of running and maintaining the data warehouse has left IT feeling numb and cold rather than energized and happy.
I have so far outlined the various pain points, issues and challenges that are faced in managing and maintaining a high-performance data warehouse. Is there a remedy in sight? Can we have our cake and eat it too? The short answer to these questions is yes, but the real answer is explained in the following sections of this article.
The Data Warehouse Appliance
A data warehouse appliance is an integrated set of servers, storage, operating system, database and interconnect specifi cally preconfigured and tuned for the rigors of data warehousing. Data warehouse appliances were introduced in late 2001 by Netezza. Today the marketplace sees solutions from Sun, HP, DATAllegro (now Microsoft), Aster Data, ParAccel, Kickfire, Kognitio, Greenplum, Dataupia and others contributing to the appliance market. Data warehouse appliances offer an attractive price/performance value proposition and are frequently a fraction of the cost of traditional data warehouse solutions.
Appliances have been built on the following basic principles and all contain:
- Shared nothing architecture.
- Commodity hardware and storage.
- Open source RDBMS platforms; ANSI SQL compliant.
- Open source operating system.
- Massively parallel processing (MPP) engines.
- Data loading tools.
- System management and monitoring tools.
- Commodity hardware and open source operating system.
- Intelligent storage with MPP built in for query optimization.
- Minimal DBA requirements for performance tuning and optimization.
- Built-in failover and fault tolerance.
- System administration requires minimal resources.
- Queries will perform faster (50 percent to 1,000 percent) on average, compared to the traditional RDBMS platforms.
- Command line options are available to adjust any data skew in the different partitions.
- A minimal indexing requirement allows data loading to occur more quickly.
- Resource allocation and marshalling on shared nothing architecture provides quicker query response.
- License costs are at least 8 to 10 times lower than the leading RDBMS.
- No additional cost for MPP engine.
- Minimal DBA requirements for performance tuning and optimization.
- Built-in failover and fault tolerance reduces outages.
- Modular scalability; build as you go keeps costs in control.
Note: As with any system, you will still need to deploy a disaster recovery strategy and backup procedures for the appliance. ETL processes will need to be developed to move data from the data warehouse to the appliance if the appliance is not the data warehouse.
Fictional Case Study
A large multinational company has an enterprise data warehouse custom built on a leading RDBMS platform. It started as a 250GB database four years ago and has grown to 3TB in size. That is almost 100 percent growth each year, but in reality has been 150 percent in the last two years. With this explosion of data volume, a number of users have been added to the system steadily over the last three years. A 20-user initial deployment has grown to a 200-user environment. This has contributed to the workload becoming a mixed environment with all types of queries being run from the data warehouse. How do we implement a data warehouse architecture that is going to be scalable and can guarantee performance and lower the overall cost of ownership, while being a strategic asset of the business?
There are four critical dashboards that will need to be refreshed on a daily basis and show a weekly tracking.
From the above problem statement, the following points have been established:
- The data warehouse has four years of history.
- The data warehouse growth is at a record pace, and the need to have both granular and summarized data is critical.
- User volumes have grown from 20 to 2000.
- The query workload is growing to a mixed workload.
- Executive dashboards need to be refreshed everyday.
In this scenario, the internal decision has been made to expand the current data warehouse and build data marts to satisfy the user needs for detail and summary data.
There are a number of issues here that will prevent this growth based on traditional solutions, including cost of RDBMS licenses, cost of storage, data footprints, maintenance issues and SLAs. How do you justify the addition of a data warehouse appliance? Begin by considering the business case and fi nancial justification.
The business case for buying the appliance has four main points. A performance issue exists with the current reports and their availability. The executive dashboard is not available with most updated information. The fi nance and accounting departments request multiple summary tables, and their refresh is consuming all the critical time. The auditing department wants the daily records loaded, rejected and reasons report by 10 a.m.
The financial justification for buying the appliance includes the costs of licensing, storage and administration. The current database license will not allow addition of users. The additional cost of $50,000 is required to add more users. Second, to satisfy the requirements of the fi nance and accounting departments, $500,000 is required to buy SAN storage.
Additionally, another DBA will need to be hired to maintain and manage the database. This will be a burdened cost around $100,000 per annum.
Based on the points above, it is clear that from a tactical perspective, the appliance could answer the performance and availability questions very precisely. From a strategic perspective, the overall cost of implementation and ownership definitely looks tangible, and the costs must be understood in order to justify the investments into the new technology.
There are different costs to be considered while deploying the appliance. First is the cost of hardware and software. Typically, the appliance starts with a terabyte range for deployment. This will include hardware, software and operating system costs and should be approximately $100,000. Next is the training costs. Typically the vendors offer training for DBA and developer resources at nominal costs on site. This should be factored into the appliance pricing, if possible. Third is the ETL development cost. While this cost will be the same whether you deploy the appliance or expand the current system and add storage space, there will be an initial cost to add the ETL to the appliance due to the nuances in the SQL definitions.
Apart from these costs, the standard cost of IT floor space and power will need to be factored in. This should be consistent whether you add more disks to the current database or deploy the appliance.
Based on these points, a business case and justification for investment in the appliance technology has been covered. Now we move on to the integration architecture and start examining the integration touchpoints for the appliance deployment.
Figure 1 shows the appliance integration architecture. The different areas in which you can integrate an appliance are across all layers except the data source layer. Even in this layer, the appliance can be implemented as a pure storage mechanism in an augmentation mode. There are several factors to consider before you augment the appliance:
- What data will you move to this layer?
- What are the security and access requirements the data from this layer?
- What are the compliance standards for storage that need to be considered?
- What is the lifecycle of the data within the appliance?
- What are the data latency requirements for loading and querying the data from this layer?
Follow these five steps to plan the integration:
- Define the user requirements.
- Define the data model.
- Define the data movement process.
- Define the physical architecture of the data mart or data store.
- Implement the data warehouse appliance.
The data warehouse appliance can be integrated in two ways. The first is the augmentation architecture. With this technique, the appliance is an augmentation to the existing RDBMS platform, as shown in Figure 2. The pros of the augmentation architecture are that the appliance can be architected to minimize the data workload and the footprint simultaneously, and that the data within the appliance can be configured and laid out to support a specific set of needs because the base data is available in the appliance itself. The con is that because the appliance is in a slave mode under the RDBMS, a minimal latency exists in the data traffic which cannot be completely avoided.
The second architecture for integration is the standalone architecture. With this technique, the appliance is a replacement to the data warehouse, as shown in Figure 3.
The pros of this architecture are that the data within the appliance can be configured and laid out to support the data warehouse and that the performance and scalability increase. The cons are that you will probably spend more on redevelopment costs for loading and querying the data, critical applications that run from the data warehouse outputs will require planning and a well thought-out sunset strategy is needed in this approach.
Best Practices for Implementing a Data Warehouse Appliance in your Organization
1. Determine your needs. The first action item of implementing a data warehouse appliance is to
determine your requirements. Be sure to consider:
- Data availability current and historical.
- Analytical requirements.
- Reporting requirements.
- Query performance SLAs.
- Data loading SLAs.
2. Analyze the architecture options. If you are an Oracle or SQL Server shop, you can implement an augmentation architecture, if you are a DB2 or Teradata shop, you will be better off with the standalone architecture.
3. Plan to minimize the data footprint. By implementing a data warehouse appliance, reduce the data footprint in your infrastructure. This will require a data consolidation project to be implemented along with the data warehouse appliance implementation.
4. Make the most of the solution architecture. Plan to optimize the different layers in your solution architecture when implementing an appliance. For example, in implementing a standalone solution with Netezza, you will plan for an optimization of queries. Similarly, when you implement Aster Data, you will plan for optimizing ETL, querying or both. The underlying architecture of an appliance will determine the ideal optimization that can be achieved.
5. Consider scalability. All appliances have a Lego-like scalability. Plan to add the additional space as required. This kind of planning will help you save initial and ongoing costs.
The appliance can provide an unmatched scalability at an affordable price and, in certain organizations, can leverage the existing investments too. To make the most of your investment, you need to wisely choose what problem you want to address when you implement an appliance.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access