Over the past 10 years, the business intelligence (BI) community has been bombarded with an ever-increasing number of emerging vendors and products, touting new and better functionality than before with more buzzwords and acronyms than even your most savvy IT professional can keep up with. The world of reporting, database and extract, transform and load (ETL) tools as separate commodities is starting to fade as vendors are offering one-stop shopping for all your BI needs. In particular, many database and reporting tool vendors now provide ETL functionality as part of their BI suite, either through enhanced functionality or via acquisition of third-party software. However, even with the tremendous availability of software and functionalities on the market, there is still no shortcut to building a quality BI solution, even with the best tools available. In fact, the task often becomes more difficult as enterprise solutions are often built using best-of-breed technologies from different database, ETL and BI/reporting tool vendors. Developing a quality solution requires quality within and across all components of the BI architecture.
With the front-end focus of many BI projects (after all, this is what sells BI engagements and what the business users see at the end of the day), it is not surprising that organizations often lose sight of the workhorse of the BI effort - the back end of the production line where data is acquired, cleansed and transformed into useful information for downstream consumption. This effort often accounts for at least 70 percent of the overall development effort of a typical BI project and, if not architected correctly, renders the presentation of data useless. Garbage in is garbage out, whether the data is coming from the source systems or an enterprise data warehouse.
Typical high-level enterprise data warehouse architectures contain several noticeable key components. We've all seen them, and most are usually cookie cutter in nature, as illustrated in Figure 1. However, the most important component is often talked about briefly prior to the beginning of an engagement and given the lowest priority once development has started. I am talking about quality control, a component that is not solely concerned with the quality of the data as it moves through the different layers of the architecture (as important as that is), but one that focuses on the quality of the ETL deliverables and processes. Implementing a quality solution from a back-end perspective is not an easy task. It should be no less complex than implementing a quality OLTP or ERP solution. In many cases, it may be more complex. However, it is not an impossible task, and when architected correctly, it can be leveraged for future phases and/or future BI efforts.
Figure 1: EDW Architecture
The key to building a quality ETL solution is to define and develop standard processes and templates throughout the architecture, design and development phases of the project and ensure that those standards are leveraged correctly. These standards encompass more than just corporate naming standards for database development; they also include process standards, standards for all ETL processes and data movement activities across the entire data architecture. Developing a consistent set of standards ensures conformity across the components, reduces development and testing times, simplifies maintenance and reduces total cost of ownership. Also, because many BI efforts are now developed using a combination of local and contractor resources, including offshore development and support models, it is increasingly important that these solutions are designed and built with solid common standards and processes in place that have been tried and tested.
As different as most BI projects are, from an enterprise perspective the architectures are very similar, and standards can be developed for many of the components of the back-end architecture. The minimum standards that should be defined include:
- Database naming standards for the enterprise data warehouse (EDW) and/or data marts, including intermediate staging areas;
- ETL standards for all the ETL components, whether using an ETL tool or other methods for ETL; and
- Operating system and standards for security where appropriate.
These standards, however, only scrape the surface when it comes to developing a quality solution that should also include:
- Defining and/or developing processes for all common components of the architecture;
- Defining and implementing procedures and templates for ensuring that standards are followed and implemented correctly; and
- Defining appropriate levels of testing and oversight to ensure that all deliverables have been tested thoroughly and implemented consistently.
Common Process Standards
Standards should be applied throughout the back-end architecture, leveraging existing internal and vendor standards where appropriate. However, to be truly effective it is important to standardize all back-end processes. Defining standards for these processes up front is one which will certainly pay off during other development initiatives, as future phases and projects can be implemented more quickly and new resources can get up to speed in a shorter time frame. From an enterprise's perspective, activities that can and should be standardized include:
- Data acquisition and staging activities,
- Model and ETL standards for maintaining history in the different layers of the architecture,
- Data quality and exception handling activities,
- Conformed dimension strategies and standard methodologies for maintaining dimension and fact tables with considerations for maintenance and performance, and
- Metadata management and integration with ETL metadata.
Data Acquisition and Staging Standards
Data acquisition and staging processes should be defined with considerations for efficient data extraction and to minimize intrusion against upstream source systems. Activities that can be standardized include:
- Defining the formats and content for source system extracts and acquisition methods (including file naming standards, formats and delivery mechanisms);
- Defining additional staging metadata (used to support common downstream processes such as auditing); and
- Defining common processes for all data staging activities.
Data Quality and Process Standards
Developing a robust data quality solution is a key component of any enterprise BI architecture that should include:
- Defining how data quality issues will be addressed and how the ETL processes should handle the different exceptions, and
- Defining how and where data quality metadata is maintained and how the metadata is leveraged by the processes that use that information.
The level of effort during this phase is dependent on the organization's priority for data quality. However, a robust data quality solution must be flexible enough to change as the priorities for the treatment of data quality changes.
Standards for Implementing Conformed Dimensions
An enterprise BI solution typically separates the integration area of the data from the downstream data marts (hub-and-spoke architecture). These components may exist in the same database instance or multiple databases but are often in different instances and reside on different database platforms. In order to reuse common dimensions, it is important to develop standards for a conformed dimension strategy that should include at a minimum the following:
- Defining the dimensions that should be conformed;
- Defining how the conformed dimensions should be built (Type I or II, etc.) and maintained; and
- Defining how downstream data marts should implement the conformed dimensions.
Standards for Fact Table Builds
The biggest issue when building fact tables (or any large table in general) is typically related to the performance of the build process, especially as data volumes increase over time. Fact tables grow quickly, and rebuilding a fact table completely during the load process is not an efficient design. Standards should be developed that determine how fact tables are built, and in many cases, common processes can be developed to support this effort. This can often be achieved by creating a set of processes and standards for fact table builds that can be leveraged across many of the ETL components. This not only allows applications to scale over time but reduces future development times for similar activities.
Developing Common Processes and Templates
Developing common processes before or at the start of the development phase will simplify the development process and ensure that all ETL activities are built in a consistent manner. This will speed up the testing process (because most of the common activities should be designed, developed and tested up front) as well as code reviews because most processes should be built with a common look and feel.
Common processes and templates that should be developed include:
- Documentation that illustrates the standards just described as well as templates that illustrate how the different standards can be adopted throughout the different ETL processes.
- Building common processes and scripts that can be leveraged across all of the ETL components. This includes metadata models and processes for auditing and exception handling, reusable scripts for file and database logging, and reusable templates (for example, UNIX scripts and functions) that can be leveraged across the ETL processes.
In general, any activity that is common across the ETL processes should be standardized. Many of these standards can be implemented across the different layers of the architecture, including the EDW and data marts, and leveraged across future BI efforts.
By developing standards and processes throughout the back end of the BI architecture, the design, development, testing and implementation activities can be realized in a shorter amount of time and in a consistent manner. Much of the ETL code has a similar look and feel, and can be tested and reviewed quickly. By creating common processes and procedures, much of the complexities associated with robust back-end architecture can be developed and tested independently of the processes that will use those components. Developers can be productive when a large percentage of the development effort has been standardized, and the client can be assured of a quality solution when deliverables are created in a consistent manner.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access