Analytical processing is the multidimensional analysis of analytical data. It enables users across the organization to access and analyze historical data in the organization’s analytical database. An analytical system feeds on disparate and heterogeneous transactional systems gathered into a structure that allows fast access and multidimensional analysis, which enables informed decisions. An analytical system is designed to get the information out of the system for the purpose of multidimensional analysis whereas a transactional system is designed to get the data in. Data in the corporate data warehouse is analyzed and transformed into information that forms the basis of decisions taken by executives. A transactional system is primarily for business record keeping – to store data generated from day-to-day transactions conducted across the organization. These two systems have contrasting purposes and designs. This article will distinguish the two systems with respect to nature of data and processing. (Other differences exist but are beyond the scope of this article.)
To begin, let us consider the following two statements:
- Data characteristic: Data in analytical system is molecular and data in transactional system is atomic.
- Analytical processing: An information consumer who is interested in a particular piece of information is also interested in other related information.
First, I’ll address distinction in terms of data characteristics and then in terms of (analytical) processing. Atomicity characterizes a transactional system. Molecular data characterizes an analytical system. Atomicity refers to each transaction in a single operation, i.e., the all-or-nothing aspect of SQL is carried out, not just part of it. If all steps are not carried out, the entire process is aborted and the session ends with Commit or Roll Back on the entire process. In an analytical system descriptive data is grouped together in different levels of hierarchies, while retaining data characteristics and giving metrics a meaningful view. Ananalogy can be made from the physical world to characterize data in analytical system as molecular as it is composed of different data items from the transactional system. In any dimension, data items are grouped to incorporate the business view of the organization. This imparts meaning to business questions that users would seek to answer. The questions that are asked are predefined. Grouping provides perspective to data in an analytical system. What characterizes analytical data as molecular is the processing of aggregation queries.
Imagine a user asking a question: What was revenue? That question does not make any sense. What was revenue for the year 2001? – that’s better. What was revenue for the year 2001 for the product milk?” – that’s even better. What was the revenue for the year 2001 for the product milk in the state Jharkhand? (See Figure 1.) – is even better. Adding more dimensions in business questions provides perspective to the way we view our business.
Figure 1: Taxonomy of Products
The most hackneyed dimension is the Time Dimension in an OLAP application. Let us look at the taxonomy of Time Dimension and the way the columns in the underlying relational data are associated for grouping in the same dimension. In the transactional database, there will be a date field only. To create Time Dimension one will need to split the field into three Year, Month, Day and key field (by concatenating these split fields) so as to associate with the metric being analyzed. This helps multidimensional modeling by putting these columns in the underlying database in level of hierarchies. Dimensional modeling using star schema renders data in an analytical system into molecular form. What a user views in an OLAP application is aggregate data which is a result of manipulation of data at the base level meeting specific criteria. Taxonomy is used for categorizing the level of abstraction of questions that commonly occur in a business organization. These questions exploit precalculated aggregations stored in multidimensional structure.
The benefit of aggregated data is improved performance. Users typically look first for high- level values, such as overall sales, which are precalculated by OLAP engine not computed on request. As a result, OLAP engine gives an almost instantaneous response even to complex queries.
Why does precalculating and storing aggregates ensure consistent, fast response? In Figure 2, we can see that to answer a sophisticated business query such as: What is the total of all types of expenses in Ranchi? – only one cell needs to be read:
Figure 2: Table for Storing Aggregates
Figure 2 proves the existence of such decompositions at the transactional level and then composes them in certain fashion, such as star schema, that renders data in molecular form in an analytical system.
Aggregate data is the result of manipulating microdata by totaling the number of cases meeting specific criteria, by summing microdata variables for specific subpopulations, by listing cases that meet specific criteria, etc. – a result set derived through manipulation that has a specific relationship to other result sets derived during the same process.
It is an n-dimensional structure displayed in a 1- or 2- dimensional format. Additivity of such data imparts meaning to questions posed by business users. There are two major forms:
- Online Analytical Processing (OLAP) – Human browsing of data summaries
- Data Mining – Using computers to apply complex analytical formulas
The taxonomy provides a useful structure in which to categorize test questions, since business users will characteristically ask questions within particular levels, and if you can determine the levels of questions that will appear on your exams, you will be able to study using appropriate strategies. A business user initially will not know what questions he/she will ask. Browsing through data will enable him/her to ask related questions.
The following taxonomy of metric components helps with metric specification:
- Metric subjects are the things in the business to which we assign numerical values – the quantities of the business. Typical subjects include cost, profit, duration, value, share, rate, ratio, etc.
- Metric objects are the things in the business, which we have the need to quantify. These are the things that we want to count, value, prorate, etc. – the entities of the business. Typical objects include products, materials, customers, etc.
- Intersecting subjects and objects yields classes of metric, such as product cost, customer value and process duration.
- Applying strata to metric classes results in a metric syntax. A stratum is any criterion by which we wish to layer, partition, select, aggregate or otherwise manipulate the metric information. Common strata include time, geographic and organizational aspects of the business. Examples of stratified metrics are monthly employee costs and annual employee costs. Further stratification may distinguish between monthly cost per employee and monthly employee cost per department.
A transaction may occur at a micro level of seconds while data in an analytical system is needed on a daily basis or, in some cases, a monthly basis. The granularity of data in an analytical system is dictated by the business requirement of the department/enterprise.
The information necessary to report the result of a measurement depends on its intended use. The guiding principles are:
- Present sufficient information to allow the result to be reevaluated if new information or data becomes available
- Provide too much information rather than too little.
How a business user processes information is based upon his/her experiences during several interactions with a data warehouse. It explains the involvement of logical reasoning, attention, the organization of information, conceptualization and integration. In a cognitive process doing analysis would typically observe the following route, depending on the cognitive level of the user:
- Taking ideas (divide classify take part),
- Learning apart (uncover simplify discover),
- Separating into component parts (compare examine),
- Seeing relationships (search analyze break down),
- Finding unique characteristics (inspect contrast put into categories sort).
To understand business intelligence from a cognitive process perspective, it is necessary to first understand intelligence. Intelligence is the ability to acquire knowledge and apply that knowledge to take effective actions.
Knowledge acquisition involves:
- Receiving data (eitherdescriptive or quantitative) through sensory or other input streams.
- Assembling information by correlating, classifying and filtering data. Information is classified as factual (independently known and recorded items) or metric (indicative conclusions from many facts). Typical technology applications work with factual information.
- Retaining information as knowledge. Knowledge is remembered, observed and believed. Database systems serve as a business memory, capturing remembered knowledge. Every organizational culture has a structure of believed knowledge, which will influence how information is applied.
The data warehouse users work in a discovery mind-set, during the initial sessions the user doesn’t know what he/she wants. The user initiates his/her sessions with a rudimentary question, which he/she improvises with advanced interaction with the data warehouse. The initial input from the user to the analytical system is one logical unit of work, which the system processes, based on precalculated aggregation. Now in the advanced stage, the output may become input to the analytical system. This process may continue indefinitely, theoretically. This augments the complexity that surrounds the cognitive process given the capability limit of human mind. There are two things to explore:
- If a user initiates a session with the repository with a particular test question – what are the related questions they want answered?
- What is the optimum number of steps before the user is satisfied or the set of generated questions become irrelevant and abandons the session.
An example, a hypothetical business organization with the two departments D1 and D2. D1 is responsible for procurement of raw materials in the market and D2 is responsible for the sale of the sale of finished product in the market. The term “market” will mean different “Objects” to both these departments in the same organization. Their respective view of the market and the way the respective market organized would be completely different.
These two departments in the course of conducting their respective businesses generate a set of subjects over a set of objects. They also meticulously maintain these subjects and objects which, in turn, are subjected to an ETL process to load into the data warehouse to ensure the availability of analytical data which may be required by these two departments for the purpose of analysis.
Now, let us consider Price and Market from users in the two different departments. Price to department D1 user means what that department pays to the supplier to procure raw material from the market and Price to department D2 means what it gets paid on selling finished products.
When either of these two users initiates a session with the repository, they may be prompted for which Price they are interested in. Depending on their selection, the user would automatically be guided for the choice of the related object. The machine/system/ architecture has got to be that intelligent.
A user connects to the repository with a preformulated test business question in mind. On the top of the repository sits the structure for navigation or taxonomy. See Figure 3.
Figure 3: Cognitive Process
There are now two possibilities:
- The question gets answered with the report sent to the user’s screen
- Some related business questions are generated, which necessitates that other related reports must be generated
For example, if the user’s interest lies in the subject Profit – they may want to break down this subject into cost and revenue with the help of some formula that incorporates a business rule. This is self-evident. There are other instances in the enterprise-wide cognitive process that may need decomposition of a subject which may be related to some other subjects and a user would not be satisfied looking at the higher level of subject. That again depends on the profile of the user. Not only this; but the user may change the plane itself by shifting from one object to the other and may like to drill down with the structure for navigation at the users’ disposal.
The user goes on and on asking meaningful related questions; sometime meaningless questions though remotely related they might be. There comes a point when the user exits the analytical system either satisfied or unsatisfied.
There is yet another unanswered problem – finding the optimum number of steps/jumps/ sessions that a person might need to spend to exiting the system. This assumes importance in the face of cost while a user is romancing with analytical data while in session. This requires profiling of users across the organization and the time spent on the session to set up an empirical study.
Alok Kumar is an associate with Cognizant Technology Solutions at its Kolkata (India) center. He has more than four years of experience in data warehousing in the areas of ETL, database reporting, OLAP applications, visualization, data mining and statistical analysis. The author acknowledges inspiration from Dr. Mukund Mohan, BI Practice Director iGATE Global Solutions (formerly Mascot Systems Limited), who now works with IBM at the Bangalore center. You can contact Kumar at firstname.lastname@example.org.