Enterprises are increasingly information centric, and recent trends reveal that most competitive businesses require external data in their enterprise data warehouse to strategically position themselves in the market. This article touches upon a few critical challenges specific to integrating data from external systems as well as best practices and considerations to do so successfully.

Most enterprise-wide data warehouses are built by integrating data from the sources within the enterprise. Sometimes reference data from an external system is stored within the EDW. In recent years, because outsourcing has become the norm for cost-effective operations, integration of data from third-party suppliers has become essential for seamless operational efficiency. The wealth of information available from research firms and marketplace transactions is becoming integral to businesses’ strategic decision-making, and as a result, this data finds its place in enterprise data warehousing. In simple terms, the boundary of an enterprise has widened as organizations have become increasingly information driven.

Integrating data from external sources, especially transactional data, is full of challenges. When a third party participates in business operations, such as shipment of products or settlement of credits, the EDW will be expected to contain the details of the third-party transactions partially because data warehousing shifted gears from being just strategic to an active operational system in recent years. Also, there could be instances that require integrating real-time transactions in a specific market for analysis to strategically position the enterprise ahead of the competition.

Typical Data Integration Challenges

While EDWs appear rather generic and accommodating from a business perspective, the features of an EDW become challenges from a data integration perspective. Three key data integration challenges with respect to an EDW are:

  • Completeness of information that deals with adequacy,
  • Correctness of information that deals with accuracy and
  • Criticality of information that deals with on-time availability.

These challenges become more prominent with the expectation that EDWs are complete and the requirement that multiple views of data should not lead to inaccuracy while blending them with data from external sources (unlike the data mart approach). In order to address these issues, big vendors in the market started promoting industry-specific data models that can be implemented with some customization to fit the needs of the enterprise. However the challenges specific to external data sources go beyond these generic data models.

Categories of Issues with External Data Sources

Issues that may pop up while integrating data from an external source can be categorized into four main areas.

1. Compliance of external data with in-house data.

This challenge will be prominent when you try to integrate data from an external source into an existing data model. The distinction between subject areas might fade when different businesses use different terms for similar entities. For example, a customer in an EDW could be an account in the external source. Though the designers may want to decipher the true meaning of data in the external system, the challenge is that the entity in the external source may behave exactly in the same way a similar (but not the same) entity behaves in the EDW. To be more specific, the external source may call the account inactive if there is no transaction with that customer for more than 180 days. But the EDW may never call a customer inactive unless there is an explicit transaction that requests the business to close all the related accounts. Whatever the intelligent transformation rule is to handle such anomalies, the amount of effort and time required to make that transformation will far exceed the value it can bring.

Most of the time, information or documentation on third-party interfaces is very minimal, and one may never be able to concretely suggest a transformation. Sometimes, due to inadequate understanding of external data, an EDW may absorb the external data as such, containing incoherent data from multiple sources that are forced fit into the same table.

It may be worth storing such noncompliant data in standalone tables, in whichever flavor we understand the data. The integration can then be accomplished through integrated views that are supplied only to the specific users who need that third-party data. After a period of stabilization, these structures can be physically merged for performance reasons into the same tables, as the behavior of the data has been studied consistently over a period of time. There must be periodic revision of interfaces by meticulous data analysis if this data is critical for strategic decisions. Sometimes, just because the source system changed something, the EDW may start receiving garbage that will be noticed during month-end reporting.

2. Completeness of external data.

Typically for every entity, we consider some of the data quite essential, and this data cannot be missing. The data we get from an external system may or may not have all these essential elements. For example, it may be possible to receive data from the market on the transactions done by the competition; however, this information will not be complete due to anonymity of data in the interest of the customers. That said, the available pieces of information might be useful to make a few strategic decisions for the enterprise. If the anonymous details are forced to fit into the same structures containing the transactional details of the enterprise, the heterogeneity (due to missing information) will spoil the completeness of data in the EDW. Though defaulting can handle this issue technically, defaulting most of the columns except for primary keys will lead to undesirable impacts, such data losing credibility in the eyes of the users. In such cases, the design team can get this data in a look-alike entity trimmed down to fit the data, as incomplete as it is. The users may not understand the impact of populating defaults or NULLs excessively if the information is missing or incorrect. The result may be that some unnecessarily question the reliability of the data warehouse, while the fact remains that the source is simply incomplete.

3. Agreement between external source and EDW governance.

Though the volume of the external data may not be as much as the in-house data most of the time, the small volume of external data can take away the trust the users have in the data warehouse if the data is not predictable and understandable. It is critical that the changes in sources are periodically reviewed and that the EDW is refreshed accordingly with new transformations as required. If the data is from a supplier, it is advisable that this data expectation is contracted along with the commercial contract for the operations.

While outsourcing the operations to a third party, most enterprises do not foresee a necessity to make the data or reporting expectations part of the contract. By contracting this upfront, however, they will have a mandate on the suppliers and can even have a subsequent penalty if service level expectations around data arrival and quality are challenged. If the external source is a standard source that is available on subscription-like market research data, the subscribers are normally sent the details of changes in interfaces. The challenge could be that the interface documentation from such a standard external source is quite extensive, and dedicated analysts will periodically be required to go through it to judge the impact in the EDW.

4. Adherence to operational standards.

While analysts may thoroughly investigate the quality of data and compliance, the ignored parts are the operational aspects of transferring the data from an external system into the EDW. Though this will get prioritized while beginning the development, considering it as part of the initial design itself will make an immense improvement in schedules. If the third party does not have a provision to encrypt the data before transmitting through the network, it might mean additional cost at the last moment to enable encryption. If the two ends of the data transmission do not use similar encryption and decryption techniques, no data transmission will be possible. Some of the enterprises never allow a push operation for data files from their network, but they might allow a pull, allowing an external job into their network through specific port after appropriately tunneling the firewalls. Almost every enterprise has its own network and data transmission standards. If these standards are reviewed right at the early stages and a common mode of transfer is accepted by both the parties right at the beginning, delays during development cycle could be reduced. There have been instances when, due to non-matching data transfer standards, enterprises have lost access to external data.

While challenges in data integration are common, the scale of the issues could be beyond predictions when considering external data. By carefully addressing the challenges described in this article, either by using the techniques suggested here or by adopting other convenient approaches, an EDW can be maintained with integrity and reliability. Utmost care is required if the external data is added to an existing industry-specific data model. The minimum volume of external data that comes from external sources can corrupt the credibility of the information if due care is not given.