Sometimes I get the feeling we are all ostriches, burying our heads in the data so we can hide from the approaching information age. Maybe thinking "database" limits our potential to design the "distributed information architectures" that our organizations need to remain competitive. Perhaps creating a broadly distributed information architecture requires substantially more than multi-point access to reorganized data. We all talk about wanting to enter the information age, so isn't it time to replace our data-based dependency with technology designed to uniquely represent constructs of information.
Data Is Not Information
We may lump all computing under the broad title of "information systems;" but, in truth, most computing to date has been "data processing" the evolution of which has been facilitated by the evolution of both computing platforms and data structures. Process-driven transaction systems drove the ascendance of the large, centralized mainframe supporting large flat files. Departmental computing (largely OLTP) gave birth to relationally powered client/server architectures. Now, the demands of analytical processing drive us toward a highly networked model fully distributed architectures and Internet. So what is the appropriate parallel evolution in "data structure" for OLAP?
Maybe it's not a "data structure." Some technology professionals are beginning to understand that the key to unlocking the door to the information age involves recognizing that data and information are inherently antonymous. In fact, data is a byproduct of transactions and data processing. Actually, most current information systems including business intelligence (BI) are really just data processing systems that regurgitate data compiled from different sources. Data is a building block of information but not, in itself, information.
Information results when data is delivered in such a manner that an analyst can see the answer inherent in the relationships that exist within the data. Enhancing raw data through aggregation, transformation, combination and a radically different delivery method forms information.
Understanding the definition of information is perhaps best accomplished by examining its attributes within an efficient information system or, better yet, within an "information warehouse."
An information warehouse must be able to:
- Distill relevant information from massive quantities of transaction- style data into significantly smaller information structures.
- Provide fast and consistent response times to user queries, regardless of the complexity of the query.
- Allow easy distribution of information content to other sites and to other processes while being able to track and adjust the content as the information changes.
- Support information structures in a polymorphic manner. This includes modifying the query behavior to accommodate the needs of query and exploration tools and a variety of conceptual structures. Some examples of structures that must be supported include limited dimensionality (3-6 dimensions) with a heavy hierarchical model; high dimensionality (6-14 dimensions) with a flat hierarchical model; and both time-series and static data sets.
- Maintain direct links back to source data from the warehouse or OLTP system so that knowledge gained from exploration of the information warehouse can be brought into direct action without violating individual privacy.
Simple examination of this list, would indicate that corporations from the Fortune 500 on down are already adopting an information warehouse approach in their data warehouse projects. Yet, for the most part, they continue to process data on servers and distribute "answers" over either an Internet or client/server structure. A true information warehouse system is able to store and distribute the actual information, not just source data or "answers."
The values delivered by such an "information warehouse" include:
Responsiveness: Users have always been impatient for answers, and Internet users are doubly impatient. The Internet provides direct and immediate feedback to their requests. As a result, users are being conditioned to expect and demand one- or two-second response times to their queries. When users hit a site or a page that is slower to respond than the preceding ones, they'll quickly hit the stop button on their browser. In the same manner, users have become impatient with environments where one information query might take two seconds while another, seemingly similar query, takes 10 minutes.
Consistency: Overall, the Internet provides consistent response times to users' queries. While exact response time can depend on the modem connection used, once connected Internet response tends to be quite consistent.
Convenience: It is impossible to be more than a few minutes from the Internet, no matter where you are located. From corporate Internet gateways to airport kiosks, users can get on-line with unprecedented ease. Thus, users are no longer content to limit their business intelligence activities to just their desk.
Accessibility: Information must be able to reside on non-heterogeneous systems but still be able to reassemble as needed at the point of analysis. While a number of technologies attempted to build distributed data environments several years ago, they were, for the most part, unsuccessful. One of the predominant reasons for this was the sheer mass of the data to be distributed. However, a properly designed information system can dramatically reduce the mass of data to the point where a true distributed system is practical.
The Information Delivery Stack
In its simplest form, the storage and delivery of information can be viewed as a four-part stack, as shown in Figure 1.
At the top of the stack is some form of user interface or display context. This area of the information delivery stack has traditionally received the lion's share of attention, yet it is the part most dependent on its foundations.
Next on the stack is the calculation engine. A calculation engine understands business rules and calculations and is able to form a request to the query optimizer for the required information needed to perform these calculations. Some calculation engines are designed for general-purpose business use while others have been carefully crafted and tuned to handle the specific applications needs of a business discipline.
The query optimizer understands the underlying file system and its index scheme and is able to perform information extracts in response to requests from another process. It is the duty of the optimizer to ensure that information is returned in the most expeditious manner possible.
At the lowest level exists some form of a file system. The file system holds the actual information as well as some form of index scheme. Historically, there are two primary approaches to storing and delivering query-optimized information multidimensional OLAP and relational OLAP.
ROLAP Is Just Data with a Pretty Face
Relational databases use index structures to map what data exists but do not track the full domain of the possible. These types of index approaches work well when the task at hand is to retrieve a row (or rows) of data, but they are fundamentally unable to express the relationships between two rows of data. To provide the "information" about the relationships between rows of data, they either bolt on advanced calculation engines to handle the mapping of the possible or add proprietary extensions to an existing RDBMS. This offering, often referred to as ROLAP, is still limited by the abilities of the relational database and the degree of intercommunication between the data store and the calculation engine. It also depends on the SQL language with its rowset focus, rather than an analytical language.
First Generation OLAP Does Not Scale It Explodes!
Before you believe OLAP alone is the solution to creating a truly distributed information architecture, step back to see the big screen. The developers of OLAP systems recognized the limitations of the relational model and tried to overcome them by creating proprietary storage systems. Unfortunately, most of these storage systems were only partially successful, as they are still, at their core, data management systems. Yet, it is important to acknowledge the impact of these systems. Even if they are not the optimal solutions, by providing analytical engines coupled to data stores, a substantial number of users within the enterprise have some way to acquire significant business intelligence.
Ironically, the close tie between the analytical engines and their data storage mechanisms, the source of their current success, is also a major impediment to building a truly distributed information delivery system. This is due to the fact that data accumulates rapidly; thus the amount of stored data can quickly become untenable. A multidimensional storage system must fully map the domain of the possible. This provides a data structure that can identify the relationships between data points but, in doing so, ensures a tremendous overhead penalty for creation and storage. To manage this data explosion, corporations are compelled to invest in massive hardware systems or to force users to compromise their information needs.
Next Generation OLAP Delivers Information
Analytical repository offerings of the new generation, such as Microsoft PLATO, Sand Technology Nucleus and QueryObject System, are directly addressing the weaknesses of the previous generation using technologies such as hybrid relational/multidimensional, token database structures and mathematical representations of information that avoid sparse data problems.
Microsoft PLATO, due to Microsoft's tremendous marketing machine, is probably the best known. Its main strengths lie in being the first OLE/DB for OLAP server. The power of the OLE/DB for OLAP should not be underestimated. Despite the best efforts of the OLAP Council, this is likely to be the first widely accepted analytical interface that is not constrained by SQL. However, PLATO is clearly targeted at smaller information domains and is somewhat limited in its first release by its direct tie to SQL Server and NT. Key contribution: Introduces and enforces a new data access method specifically for analytical queries.
Sand's Nucleus is available on more scalable platforms and is also able to work with more data types than Plato. Nucleus is interesting in that it creates a database structure using a token-based, bit-vectored representation of the source data that results in quite rapid query response. This structure introduces computing economy at many levels from disk storage to CPU efficiency. Key contribution: Fast build, rapid query response, and schema flexibility compared to most OLAP technologies.
The QueryObject System differentiates itself by allowing users to rapidly access very large volumes of complex data transformed into compact, highly distributable objects that can be analyzed using industry-standard tools and techniques. QueryObjects' focus on representing the information about the data relationships mathematically avoids the "data explosion" problem, allowing for information repositories that are both faster to build than traditional tools and also significantly smaller in footprint. Advanced indexing against these "answer sets" ensures users of quick and consistent response to queries. Key contribution: Fast build and consistent query performance without data explosion.
Information Bases for the Information Age
Client/server OLAP, or BI, is the last "data processing" step before truly entering the information age. At best, it advances data from simple points on a curve or cells in a spreadsheet to seminal elements of distributable information. But, until BI sheds its legacy obsession of managing data instead of working with information system architecture, its progress toward becoming a truly distributed information system will be decisively limited. That's why leading-edge corporations are looking to newer technologies that address the issues of the information age as the core of their information delivery system.
And what about moving beyond BI into the world of "knowledge management?" Well, that's another story. And you can bet it won't be written in SQL!
Matthew Doering, vice president of product management for QueryObject Systems Corporation, has over fifteen years of experience in the BI industry on both the end-user and product development sides. Doering will entertain rebuttals from those who think they can ride the SQL horse into the information age at email@example.com.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access