Most enterprise-wide data warehouses are built by integrating data from the sources within the enterprise. Sometimes reference data from an external system is stored within the EDW. In recent years, because outsourcing has become the norm for cost-effective operations, integration of data from third-party suppliers has become essential for seamless operational efficiency. The wealth of information available from research firms and marketplace transactions is becoming integral to businesses’ strategic decision-making, and as a result, this data finds its place in enterprise data warehousing. In simple terms, the boundary of an enterprise has widened as organizations have become increasingly information driven.
Integrating data from external sources, especially transactional data, is full of challenges. When a third party participates in business operations, such as shipment of products or settlement of credits, the EDW will be expected to contain the details of the third-party transactions partially because data warehousing shifted gears from being just strategic to an active operational system in recent years. Also, there could be instances that require integrating real-time transactions in a specific market for analysis to strategically position the enterprise ahead of the competition.
Typical Data Integration Challenges
While EDWs appear rather generic and accommodating from a business perspective, the features of an EDW become challenges from a data integration perspective. Three key data integration challenges with respect to an EDW are:
- Completeness of information that deals with adequacy,
- Correctness of information that deals with accuracy and
- Criticality of information that deals with on-time availability.
These challenges become more prominent with the expectation that EDWs are complete and the requirement that multiple views of data should not lead to inaccuracy while blending them with data from external sources (unlike the data mart approach). In order to address these issues, big vendors in the market started promoting industry-specific data models that can be implemented with some customization to fit the needs of the enterprise. However the challenges specific to external data sources go beyond these generic data models.
Categories of Issues with External Data Sources
Issues that may pop up while integrating data from an external source can be categorized into four main areas.
1. Compliance of external data with in-house data.
This challenge will be prominent when you try to integrate data from an external source into an existing data model. The distinction between subject areas might fade when different businesses use different terms for similar entities. For example, a customer in an EDW could be an account in the external source. Though the designers may want to decipher the true meaning of data in the external system, the challenge is that the entity in the external source may behave exactly in the same way a similar (but not the same) entity behaves in the EDW. To be more specific, the external source may call the account inactive if there is no transaction with that customer for more than 180 days. But the EDW may never call a customer inactive unless there is an explicit transaction that requests the business to close all the related accounts. Whatever the intelligent transformation rule is to handle such anomalies, the amount of effort and time required to make that transformation will far exceed the value it can bring.
Most of the time, information or documentation on third-party interfaces is very minimal, and one may never be able to concretely suggest a transformation. Sometimes, due to inadequate understanding of external data, an EDW may absorb the external data as such, containing incoherent data from multiple sources that are forced fit into the same table.
It may be worth storing such noncompliant data in standalone tables, in whichever flavor we understand the data. The integration can then be accomplished through integrated views that are supplied only to the specific users who need that third-party data. After a period of stabilization, these structures can be physically merged for performance reasons into the same tables, as the behavior of the data has been studied consistently over a period of time. There must be periodic revision of interfaces by meticulous data analysis if this data is critical for strategic decisions. Sometimes, just because the source system changed something, the EDW may start receiving garbage that will be noticed during month-end reporting.
2. Completeness of external data.
Typically for every entity, we consider some of the data quite essential, and this data cannot be missing. The data we get from an external system may or may not have all these essential elements. For example, it may be possible to receive data from the market on the transactions done by the competition; however, this information will not be complete due to anonymity of data in the interest of the customers. That said, the available pieces of information might be useful to make a few strategic decisions for the enterprise. If the anonymous details are forced to fit into the same structures containing the transactional details of the enterprise, the heterogeneity (due to missing information) will spoil the completeness of data in the EDW. Though defaulting can handle this issue technically, defaulting most of the columns except for primary keys will lead to undesirable impacts, such data losing credibility in the eyes of the users. In such cases, the design team can get this data in a look-alike entity trimmed down to fit the data, as incomplete as it is. The users may not understand the impact of populating defaults or NULLs excessively if the information is missing or incorrect. The result may be that some unnecessarily question the reliability of the data warehouse, while the fact remains that the source is simply incomplete.











Be the first to comment on this post using the section below.