Service-oriented architectures (SOAs) and, in particular Web services, trend away from centralized monolithic applications and centralized data storage toward distributed applications as a service and distributed data as a service. In many cases, moving data to a remote location raises data ownership and storage issues of responsibility, accountability, security, privacy and legal restrictions. In addition, Sarbanes-Oxley regulations include the legal requirement to store and access original data or copies of original data, not the modified data that typically resides in data warehouses.
A recent DM Review article discussed major database products that offer universal data integration, including structured and unstructured data, and binary large object files (BLOBs) such as images, audio and video.1 However, when the large database vendors refer to data integration, they are generally talking about moving and storing all data in a single database - a data warehouse. Other data integration vendors leave data at its source and deal with it there, as with federated database systems, which fit better with the increasingly favored SOA approach than data warehouses. In fact, data warehouses seem to be in conflict with global 24x7, real-time interactions and the SOA approach.
Data Integration, Sharing and Interoperability Options
Some of the reasons that data warehouses are chosen over federated database systems in many situations are to have:
- Clean and transformed (usable) data;
- Indexes that support a wide range of queries, including fuzzy matching;
- Name variation, nearest neighbor, etc.;
- Complex queries without shutting down data source systems;
- Aggregations and other calculations; and
- Support for business and other intelligence applications.
Federated database systems typically cannot support any of these operations, besides fetching data for applications and populating associated databases to perform these operations. Enterprise search is an alternative approach, but generally cannot cope with the more structured data requirements of business monitoring, analysis and reporting. Therefore, the data integration, sharing and interoperability options facing organizations are usually a difficult choice between a data warehouse or a federated database system, as enterprise search is generally considered a distant third choice.
This article discusses a fourth choice: a dataless hybrid of data warehouses, federated database systems and enterprise search.
Problems with Universal Access to Structured Data
Many of the problems in achieving universal access to structured data involve having to deal with multiple variations in:
- Data source schemas;
- Data types and lengths;
- Data content distribution (domain values);
- Nonstandard or unclean data, e.g., typos;
- Combined and/or separated data fields, e.g., full name to last name, first name, and middle name/initial and vice versa;
- Indexes and types;
- Query processing support;
- Connections to data sources;
- Support for query standards such as SQL;
- ID authentication methods;
- Access security methods; and
- Languages.
The seemingly simplest solution to overcoming these variations is to build a data warehouse where 100 percent control can be enforced on all of the above. Unfortunately, accessing, cleansing, transforming and moving data in bulk from multiple disparate data sources and storing it all in a one-size-fits-all data warehouse has its own problems. These include time and effort, and therefore cost, and the data ownership and storage issues mentioned previously.
Problems with Universal Access to Unstructured Data
Problems with universal access to unstructured data are that there is little to no sense-making required to index unstructured data, query success can be low due to the ambiguity of words and human review is generally required to make sense of results. Some search engines are now incorporating knowledge management tools to bring more sense and structure to searches on unstructured data.
Using Unstructured Text Search on Structured Data
Another option would be to use an unstructured text search engine on structured data, as offered by some search engines. However, this only works to a limited extent for simple cases, because the following problems typically remain:
- In the case of relational databases, only records in tables with search terms are returned; no JOINs between tables are made and no validation that tables are connected in any way.
- No range queries;
- No nearest neighbor matching;
- No allowance for data transforms;
- Only return results where an exact match is found, e.g., for name, unless some form of fuzzy match index is used; and
- Records found regardless of where a search term occurs, e.g., a street address name could be found instead of a person's name, unless some form of column-level constraints are predefined.
Advantages and Disadvantages of Data Warehouses Advantages
Data warehouses tend to have a high query success, because they have complete control over the main related areas of data management systems:
- Clean data,
- Indexes - multiple types, including text, and
- Query processing: multiple options.
Other advantages include:
- High performance,
- No index or query load on data source systems,
- Almost any data source,
- Preaggregated and precalculated fields,
- Archive,
- Denormalized views,
- Data mining/link analysis options,
- Minimum existing system interference,
- Data sources not aware of queries, and
- Security access managed in data warehouse (schema, row, column and data element).
Disadvantages
There are considerable disadvantages involved in moving data from multiple, often highly disparate data sources to a one-size-fits-all data warehouse. It translates into a long implementation time, high cost, lack of flexibility, dated information and limited capabilities.
- Major data schema transforms from each of the data sources to one schema in the data warehouse can represent more than 50 percent of the total data warehouse effort.
- Data owners lose control over their data, raising responsibility, accountability, security, privacy and legal issues.
- Long implementation time and associated high cost.
- Adding new data sources takes time, with associated cost.
- Limited flexibility of use and types of users - requires multiple, separate data marts for multiple, different uses and users.
- Typically, data is static and dated.
- Typically, there are no data drill-down capabilities.
- Typically, it cannot actively monitor changes in data sources.
- Additional cost and maintenance of the data warehouse infrastructure.








