Best Practices for Data Warehouse Database Developers
By nature, data warehouse projects are costly endeavors where many resources are consumed both hardware and software. Although, experienced database developers may use 90 percent of the same skills on their last projects, they may find themselves approaching development issues differently!
While the project team as a whole usually understands the risks and benefits of building a data warehouse, I find that not enough IT managers are initially aware of the unique challenges a data warehouse effort poses for database developers. Operating in a very large database (VLDB) multigigabyte to terabyte decision support system (DSS) data environment will often require unique approaches for database developers, where strategic value can be added to the warehouse development cycle. Understanding and anticipating the kinds of challenges developers will face in a DSS database is essential. This article will address a few of these challenges.
1. Make sure you are provided with a usable data dictionary before starting heavy-duty development. Many data warehouse projects suffer from time constraints, so it is not uncommon for some area of construction on the system database and corresponding development tasks to commence while other tasks in the analysis domain business user interviews, requirements gathering, source to target analysis, etc. are still being conducted in parallel. On these types of data warehouse initiatives, the developers seem to be perpetually playing detective iteratively asking questions about data mapping, validation ranges, aggregation and related semantics during their coding of procedures, triggers, queries, application programming interfaces (APIs), ETL (extract, transform and load) scripts, and so on. While there may certainly be crossover between the gathering of systems requirements for a data warehouse and the construction of a data dictionary, some sort of data dictionary should be in place before any critical coding or database development takes place. As this lexicon of corporate data meanings and semantics grows, the corporate data steward should see to it that things such as rules, validations and domain ranges are added, giving rise to a true enterprise dictionary. The data dictionary should be stored on the corporate intranet and available to both business users and developers alike.
Warehouses that are built without a useful data dictionary will often result in physical functional areas sharing common data elements, duplication of coding effort, increased redundant data and confusion and communication problems for the developers. The data dictionary should be stored in a meta data repository database, and a concerted effort should be made to merge and tie in the information with your ETL tool's meta data (for example, source and target mappings). Developers will be glad they have one place to find mappings, data meanings, validations, domains, aggregation rules, etc. Without an industry standard on ETL meta data, this may be easier said than done; nevertheless, the days of keeping the data dictionary solely in a spreadsheet on a file server should be over!
2. Save query plans, run times and performance benchmarks in the database. Storing processing performance information and benchmarking data in the database can be done quite easily, although it is often an afterthought in many data warehouses. For example, recording a process start time and end time for every critical batch or processing task in the warehouse can easily be implemented via such things as stored procedures, shell scripts or ETL tool tasks that serve as wrapper or control objects. These process control components become responsible for recording execution and completion statistics as they execute the critical processes in the data warehouse. Why keep benchmarks? Saving benchmarking data in the database helps pinpoint performance problems by establishing foundations of mean/median run times. This helps the team focus on tuning opportunities and gives direction on things such as hardware load balancing, troubleshooting, SLA agreement expectations and facilitates better practices on the maintenance of your system.
Keeping process benchmarks as part of your meta data is a logical extension of a robust meta data repository, providing information about your warehouse processes job sequence, parameters, run-times in one physical place. Remember that meta data should not just be data about your business- oriented data, source target mappings, etc.; it is also data about your warehouse processes. You could ameliorate this approach to track user activity, identifying bottlenecks and most-used queries by grabbing statistics on query start and end times, most-used queries, number of reads on the database, number of rows returned per query and more.
3. Save ETL, validation and processing errors in shared database tables. Similar to the previous approach is the practice of properly trapping all data warehouse processing errors in database tables. Nobody should have to wade through error logs and error tables marooned in multiple environments. All errors should be trapped, consolidated and sent to one place your meta data repository. This means that any errors that occur in the domain of the ETL tool are logged with any errors encountered in the post-ETL tool load process, whether it be from things such as loading the operational data store (ODS) or building the online analytical processing (OLAP) cube. It is important to establish error thresholds for each process in the data warehouse as well as what actions to take when those error thresholds are encountered. This is usually one area where requirements gathering falls short; nevertheless, veteran data warehouse developers will want answers about this information fairly early in the development process. E-mail notification of any errors that exceed predetermined thresholds should be the goal of any robust data warehouse.
4. Avoid long-running transactions. In your online transaction processing (OLTP) applications, you did not have to worry so much about long- running transactions. However, now those data manipulation language (DML) operations on millions of rows may fill up the database's transaction log, bringing your development or batch processing to a standstill. If you are writing stored procedures, keep them modular with respect to each unit of work, and break your transactions into more granular operations. This will also give you more leverage over error failure as you will have less to roll back when an error condition strikes, and you can isolate your errors more easily. Also, remember that you are dealing with millions of rows. All those long-running transactions may hold locks on precious data, slowing a parallel load of your database to a crawl.
5. Use referential integrity carefully. Beware of the pitfalls of using all the of referential integrity (RI) bells and whistles of your relational database management system (RDBMS); always know the performance tradeoffs with RI. While foreign key constraints help data integrity, they have an associated cost on all insert, update and delete statements. Give careful attention to the use of constraints in your warehouse or ODS when you wish to ensure data integrity and validation. Also consider the advantages of implementing certain types of validations and check constraints in your ETL tool or data staging area. While triggers are a godsend in OLTP, they may slow mass inserts into your VLDB considerably, as every row inserted will fire its corresponding trigger once.
6. Learn to recognize when the law of diminishing returns is in effect. Sometimes "good enough" performance is acceptable. Avoid the urge to perform endless incremental improvements in the optimization of your database code. Many times as a matter of pride or competition, developers try to keep tuning structured query language (SQL) or other code when, in fact, the run times of the current batch processes fit comfortably into existing batch windows. Although, this may be the simplest concept in the article, it remains very difficult for many developers to grasp. Information technology exists to support the business and its processes in a constrained time arena; know the service level agreements you have with your business users and exactly what types of improvements will help you meet or keep your acceptable levels of service.
7. Always understand your database's optimizer and query plans. Everybody knows that random-access memory (RAM) access/logical reads are always cheaper than physical disk access, yet I am always amazed at the lack of understanding and attention given to such things as query plans and I/O statistics analysis. All developers writing SQL operations against a VLDB should know how to create and decipher a database's query plan and be able to tune all data manipulation statements for best possible performance. When I encounter a data warehouse schema for the first time and I want to issue a SQL statement, I always try to find out as much as I can about the nature (business meanings, storage, indexes, etc.) of the data. Before I execute any queries against the data warehouse, I first compile them and then run them (non-exec mode) with the query plan in effect. Only when I am comfortable that I am covering indexes, issuing the correct joins and getting good I/O statistics, will I execute the query. If I am just trying to get "acquainted" with the data, I will limit my result sets so that only enough rows are returned as to provide me with some clues about the nature of what the data means, in the real world empirical sense. This approach has saved me many trips to the DBA on duty to ask him or her to kindly kill my runaway processes or Cartesian product of the day.
Be aware that some of those DML operations in your repertoire that may have been fine on an OLTP order-entry system may not work in a huge, historically archived database. For instance, if you are now inserting 6 million rows en masse from an ETL tool, you should be aware of the repercussions that clustered indexes may have on your operation the possibility that your load methodology will require the database optimizer to reorder/split some of your physical data on each insert. Even worse, updating field values that participate in a clustered index may take forever, as each updated row must be physically moved so that its location conforms to the order specified by the index.
8. Know the limitations of your ETL tool. Before you begin serious development with your ETL tool of choice, be aware of all of its limitations and how to work around them. To give an example, many ETL tools require advanced coding practices to go from long flat file structures to various types of normalized RDBMS table structures. Therefore, you may have to output DML from your ETL tool into a SQL-esque log file, parse the log file and then use the parsed file to perform inserts into your warehouse database.
Also keep in mind that many ETL tools robust as they are do not have a meta data repository that integrates easily with your enterprise repository, making it hard to change tools in midstream. Never underestimate the integration challenges that may arise when tackling your meta data requirements.
9. Be involved in planning physical environments for testing, QA and migration. Fundamentally speaking, version control and change management practices for a data warehouse are virtually identical to a normal non-DSS environment. Developer access should be restricted to the production database as database code, scripts and objects should be checked from a repository not just grabbed from production. A much more daunting task is deciding how to re-create the physical data warehouse environment so that developers get a true test and quality assurance (QA) environment separate from production. Given the huge volume of data that a warehouse contains, as well as all the sundry applications and pieces that make up its architecture, this may prove too costly to do, resulting in shortcuts or sharing architectural components between QA, test and production environments. In this case, even more thought should be given to where exactly the developer will be able to develop, perform QA and migrate new code or bug fixes. It is not uncommon for a warehouse project to be very far along before serious thought is given to migration processes and environments in which developers will conduct the maintenance and test of code because the focus tends to be on the production environment. The opposite approach of "cutting over" from development to production can be just as bad, not to mention risky. A savvy developer will start raising questions concerning the need for multiple physical environments early in the project. After all, he or she will be working every day with the physical setting provided.
Bear in mind that I have only scratched the surface of best practices for data warehouse developers. IT managers, project leaders and developers who are involved with their companies' or clients' warehousing efforts should become acquainted with these issues and sundry related considerations. While every subtopic listed could warrant its own in-depth article, an understanding of these topics will go a long way to ensure success for database developers in a data warehouse environment.