Many companies, particularly those undertaking business intelligence (BI) initiatives, struggle with data issues. The problem is rarely a lack of data, but rather an overabundance. To make matters worse, despite decades of experience in using databases to manage information, disparate data is often still spread across the enterprise, which poses serious challenges. Thus, the difficulty business analysts struggle with when looking for the few gems of valuable information that will help propel the organization toward its goals is beyond finding a needle in a haystack. It is more akin to finding the head of a pin in a haystack and the shaft in a wheat bale and then assembling them into a completed pin.
Strategies that Create Disparate Data
The problem has many constituent parts that result from both legacy systems and current actions. Considering the legacy issues first, organizations still suffer the negative impacts of past efforts to decentralize processing. When low-cost servers and client/server computing entered the scene, many companies delegated some or, in a few cases, all IT decisions to user departments. This strategy had advantages. User departments were more likely to choose applications that actually fulfilled their true business needs, rather than the needs perceived by the IT department.
One disadvantage of a decentralized approach is that there often arent enterprise-wide standards. The customer_number field in one departments database might be cust_id in another. One department might use Oracle, while another uses DB2. The engineering department and all European subsidiaries might use metric measurements, while the American marketing department uses U.S. measures. One department might store dates in a yyyymmdd format, while another uses dd/mm/yy. The list of possible disparities is almost endless.
Decentralized computing isnt the sole cause of disparate data. For example, a company that adopts a new enterprise-wide database standard typically cant replace all of its legacy databases immediately. Or, when mergers and acquisitions unite companies with different applications, they may need to be maintained until all of the systems are migrated, and a few applications from each predecessor company may be retained beyond that point. In addition, some organizations tolerate disparate data because they believe that the benefits achieved by using best-of-breed applications rather than a comprehensive, integrated application suite exceed the costs of having to deal with dissimilar databases.
The Need for Data Integration
Data format and platform inconsistencies are not issues as long as data use is isolated within individual applications. However, data needs to be aggregated, and discrepancies must be reconciled for enterprise-wide query, reporting and analysis. Furthermore, some operational applications require the integration of data from a variety of diverse sources. For example, an effective customer relationship management (CRM) application aggregates data from sales, accounts receivable, customer service and other databases to provide users with 360-degree views of customers and prospects. When integrating data, any incongruities need to be reconciled by transforming the data coming from one or more of the source databases to conform to the data schema and definitions used in the merged or target database.
Even in the absence of inconsistencies, data transformations are typically required to support BI initiatives. For example, in the name of efficiency, data may be denormalized in a data warehouse to avoid the frequent, massive table joins that business analysis tasks would require if they were run against databases structured to support transaction processing. Other transformations, which are common when moving data from production databases to data warehouses, include changing codes to meaningful labels to avoid the need for frequent table lookups and deriving search keys that will be useful to business analysts.
Its important to note that even in the absence of a need for data transformations, when an organization decides to undertake ongoing comprehensive BI initiatives, a data warehouse that is isolated from the production databases is essential. Otherwise, the strain that business analysis places on the databases would cause transaction processing to slow unacceptably. Stocking this separate data warehouse is the job of extract, transform and load (ETL) or data replication software, which is also the most appropriate component of an information infrastructure to host data transformation functionality.
Keeping It Real Time
Any avid reader of the popular business press will no doubt have seen a number of clichés concerning the pace of change in business today. The words may be overused, but business does move much faster than it has in the past. The IT field bears much responsibility for that. Marketing lead times and the time-to-market for new products and services, among other business measures, can now be much shorter than in the past, thanks to the use of IT. Also, businesses have taken advantage of this ability for speed to get a jump on the competition in increasingly aggressive, globalized markets.
In this speed-obsessed environment, business executives, managers and analysts demand to know what is happening right now, rather than what happened a month ago, last week or even just yesterday. Consequently, many companies are scrapping the snapshot extracts with which they used to stock their data warehouses monthly, weekly or daily. Instead, they are implementing data replicators that capture production database updates, transform that data and copy it into the data warehouse in near real time.
Not all companies need near real-time capabilities today. Nonetheless, because of the trend toward shortened product cycles and more fast-paced competitive environments, companies that dont need it now will likely need it soon. Implementing real-time replication and transformation software now ensures that this capability will be available when it is required, without the need for a costly, disruptive software replacement.
Data Transformation Requirements
Data transformation functionality is not all created equal. Look for the following features when evaluating solutions:
Filtering. A data warehouse often does not require all of the data in the production databases. In todays strict regulatory environment, many companies cannot delete any data, ever, yet some data is just not relevant to business analysts. For instance, a manufacturer of consumer packaged goods that customers buy frequently may decide that data more than three years old is irrelevant when studying market trends because conditions in its markets change rapidly. In addition, quantitative business analysis doesnt employ unstructured data such as product pictures and descriptive text. Moving such data into a data warehouse wastes storage space and network bandwidth, and the datas presence can bog down analytical processing.
Likewise, when using replication to share data among applications, the target application often does not require some data collected by the source. If one were to store that data in the target applications database simply because it exists in the source, it would be necessary to alter the target schema. In the case of a purchased application, this may complicate the upgrade process when the vendor delivers a new version of the software, as well as potentially voiding the support contract with the vendor.
The data replicator or ETL product should, therefore, be able to select only the required table rows and columns when delivering data to the target.
Schema mapping. Because source and target database schemas invariably differ, the product should be able to map between the schemas. This can be a simple process or quite complex, requiring such functions as lookups to retrieve common key values.
Column merging. A single logical unit of data might be contained in a different number of columns in the source and target databases. For example, a date may be stored as separate year, month and day columns in one database, but as a single date column in another. Or an address may be separated into street number and street name columns in one database, but not in the other. To reconcile such differences, the transformation engine should be able to merge or concatenate columns.
Column splitting. Column splitting is the mirror image of column merging and is required for the same reason. To split a column, one would substring out only the information from the source to be loaded into the target column.
Column derivation. It is often necessary to derive columns when transforming data. For example, the sales department may assign territories based on country and state boundaries, but business analysts may want to regularly look at the data based on territories that are defined in another way. If so, to improve the efficiency of BI operations, as data is copied into the data warehouse, it should be possible to derive an indexed territory column from, for example, the country and postal code.
Data conversion. The product should include a large set of common built-in transformations such as conversions between U.S. and metric measurements and between different date formats.
Extensibility. No software vendor can anticipate every data transformation that every company will require. The software should, therefore, allow companies to code their own transformations.
Heterogeneous support. Because the source and target databases will often reside on different platforms, the software should support all major hardware, operating systems and database management system (DBMS) platforms, as well as versions for both the source and target.
Planning Is Key
Regardless of why data needs to be transformed, planning is critical. Failure to do so will almost certainly result in the future need for costly database conversions and possibly the need to recode applications and transformation logic. The questions that need to be answered before beginning any data replication and transformation project include the following:
- What information is needed by which systems and individuals?
- Will there be a need for bidirectional data sharing and therefore the ability to apply mirror image transformations depending on the data flow direction?
- Is near real-time replication required, or is it good enough to refresh the data nightly or on a less frequent schedule?
- How much data will be transferred?
- Do the data source and target servers have sufficient processing capacity to capture the information, transform it as required and load it on the target?
- Do the servers have sufficient storage capacity?
- Does the network have enough capacity to transfer the data with adequate speed?
- If the target server does not currently exist (e.g., when creating a new data warehouse rather than sharing data between existing applications), what are the required target server specifications?
- What transformations are required when moving data from the source to the target? When sharing data between two existing applications, the answer is obvious, as it is defined by the source and target applications. On the other hand, new data warehouses and/or marts need to be architected from scratch to best serve their purposes.
- Are there any regulatory issues that must be addressed? In a number of industries, privacy regulations dictate that some data cannot flow freely across the wire. Will it be necessary to encrypt data sent on a local network and/or external network?
- What are the auditing requirements? Even in the absence of government regulations, auditors may insist on accurate records being kept of all transfers and transformations of certain types of data.
Helping business analysts find the golden needles scattered across the metaphorical haystacks and helping other users optimize the return they receive on their information assets by better sharing data among disparate applications can increase a companys profitability significantly. And it doesnt have to be difficult. The use of comprehensive data replication and transformation software is the key to meeting the challenge.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access