The expectations from a data warehouse implementation are growing at an exponential rate without compromising the ever-increasing data volumes and shrinking latency. If we think about it, we really should not be surprised at all! Organizations today demand a high degree of access to accurate and timely information so that decision making is fast and effective. At the same time, volume of data is growing exponentially every single day. In addition to this, increasing number of source systems, different data formats and system complexities are directly impacting the system resources and the ROI of the data warehouse.
One can choose industry-standard ETL tools, design robust and scalable ETL architectures, budget considerable amount of time and money for performance enhancements, still the key to a successful data warehouse implementation lies in defining and implementing an optimal data acquisition strategy.
What is Data Acquisition?

Figure 1
Data acquisition can be defined as a strategy - a data-centric processes for assembling data from disparate sources and systems throughout an enterprise or between enterprises and enriching that data in a manner that creates valuable and reusable information.
Invariably the data warehousing vision of the enterprise and the business drivers determine which data to source. At the same time, the requirement gathering process triggers the business justification of each of the source systems identified and the type of business data required. In this phase, business analysts and functional analysts do not pay much attention to the low level details of data transfer technical requirements. The focus lies in identifying and getting the right business data into the data warehouse environment. However, data acquisition as a strategy should not be only limited to identifying and getting right business data into the data warehouse environment; it should also focus on the following:
- Efficiently and effectively extract data from the source systems;
- Identify and document service level agreements with the source systems;
- Provide guidelines on the data transfer mechanism, escalation procedures in case of failures during data transmission;
- Supplement/enrich the data from the source systems for easy updates and changes as applicable in the data warehouse environment;
- Provide a framework to permit successful reconciliation strategies both at the source end as well as the target end.
Having said that, let us take a closer look at the operational challenges that we face in the data warehouse environment:
- Data volume growth - Over the years the size of the source data grows resulting in decreasing load window.
- Business drivers change - Business focus changes demanding in more in-time data resulting in a renewed focus to reduce latency.
- Cost containment - Competitive factors pose a sever challenge on infrastructure cost. Once the data warehouse has become operational, business focus shifts to analytics and reporting.
Most of these challenges are related to size of that data, the data refreshment policy and the data retention policy.
Data Capture Strategy and CDC
Once the data warehouse is built, attention shifts from the building of the data warehouse to its day-to-day operations. Inevitably, the discovery is made that the cost of operating and maintaining a data warehouse is high, and the volume of data in the warehouse is growing faster than anyone had predicted. The widespread and unpredictable usage of the data warehouse by the end-user DSS analyst causes contention on the server managing the warehouse. Yet the largest unexpected expense associated with the operation of the data warehouse is the periodic refreshment of data. What starts as an almost incidental expense quickly turns very significant.
Data Capture Options
Traditional data capture methods have relied heavily on the following options:
Extract all the data to flat files - These are usually from the legacy source systems, where in a batch job extracts all the data and puts in a COBOL copybook like structure.
- Cumbersome process
- Prone to errors hence confidence level on data quality is low
- Results in large OS files
- Data transfer time is more and needs robust network infrastructure and bandwidth
Triggers - This option warrants that the source system also has a DBMS as the database. Triggers are attached to each and every business events that capture the information and record those.
- Introduces point of failure concept
- Have to maintain multiple objects on source systems
- Involves a lot of triggers, capture tables, database links and custom codes
Merge-Purge Option - This option does a lot of verification and reconciliation processes with the source system data.
- Extract entire information
- See if records exist at the target system, compare and then process
- Large amount of data extracted every time
- Very complex compare codes
The first step most organizations take in the refreshment of data warehouse data is to read the old legacy databases. For some kinds of processing and under certain circumstances, directly reading the older legacy files is the only way refreshment can be achieved, for instance, when data must be read from different legacy sources to form a single unit that is to go into the data warehouse. In addition, when a transaction has caused the simultaneous update of multiple legacy files, a direct read of the legacy data may be the only way to refresh the warehouse.
As a general-purpose strategy, however, repeated and direct reads of the legacy data are a very costly. The expense of direct legacy database reads mounts in two ways. First, the legacy DBMS must be online and active during the read process. The window of opportunity for lengthy sequential processing for the legacy environment is always limited. Stretching the window to refresh the data warehouse is never welcome. Second, the same legacy data is needlessly passed many times. The refreshment scan must process 100 percent of a legacy file when only one or two percent of the legacy file is actually needed. This gross waste of resources occurs each time the refreshment process is done. Because of these inefficiencies, repeatedly and directly reading the legacy data for refreshment is a strategy that has limited usefulness and applicability.









