Part I of this series (August 2003 issue of DM Review) reviewed the work in network analysis of complex systems, particularly the recent research into the small-world (SW) property, aristocratic-egalitarian (A-E) distinction and tipping points. Insights are popping up in strange places. Scientists from anthropology to zoology are making significant progress in understanding their particular complex systems by applying these concepts. These systems have a surprising efficiency interconnecting elements, along with unanticipated tipping points caused by minor stimuli. Understanding the underlying structure of interactions among numerous elements is the determining factor to explaining and even predicting system behavior.
This installment concentrates on the implications to the business intelligence (BI) and data warehousing (DW) fields. In particular, how can the concepts of the SW property help us understand the data in our enterprise data warehouse? Additionally, how can we leverage those insights to solve practical problems?
To answer these questions, this article suggests a methodology called Associative Link Analysis (ALA). Enterprise data richly describes many complex systems essential to our business. These systems are composed of a loosely coupled network of many interacting elements such as customers, suppliers and distributors. The intent is that, by applying ALA to our enterprise data, we can derive insights into a specific business problem and apply actionable information to alleviate that problem.
The ALA methodology is a four- stage process, as follows:
- Problem Definition: A specific business problem must be the guiding focus of the methodology from start to end. The key business entities and the business processes acting on those entities should be clearly derived from the problem statement.
- Association Specification: Mapping the entities and processes identified to the warehouse schema is the second step. Determination of the extent and quality of available data must be objectively resolved. The result of this step is to define the procedure for generating the adjacency matrices of the associative graphs to be studied. The nodes are the unit of analysis, whose instances may number from 10^{3} to 10^{9}. The links are the associations among node instances.
- Graph Analysis: Once a graph structure has been defined, various metrics can be applied to investigate average distance among nodes, degree distribution of nodes, clustering and betweenness. Some metrics focus locally on individual nodes or groupings, while other metrics focus globally on characterizing the entire structure. The process mechanism by which nodes and links are created and deleted is a key aspect to study. Additionally, ways of visualizing the important features of the graph analysis are imperative.
- Problem Intervention: Based upon the graph analysis, the last step is to do something. This may consist of: policy changes, process reengineering, alerting application, dashboard performance indicator or simply contacting a few customers. The general strategy is to foster or retard the SW property from engaging within the graph structure. A variety of intervention tactics to manipulate nodes or links should be defined based on the specific problem to be solved.
Let's examine a critical part of this methodology: translating the warehouse schema into an associative graph.
ER Schema to Associative Graph
The challenge of the association specification step is to abstract, from the overwhelming complexity within the enterprise data warehouse, those aspects giving us the greatest insights into associations among customers, products, etc. In particular, we need to distill from entity-relationship (ER) schema those key associations into a graph structure consisting of nodes (representing the entity instances as the unit of analysis) connected by links (representing the association among entity instances).^{1 }
Analysis of a graph structure requires an adjacency matrix.^{2} This matrix has an equal number of rows and columns corresponding to each of the nodes. For a directed graph with N nodes and M links, the adjacency matrix G is defined as: g_{ij} = 1 where i != j and i = 1,2...N(1) where a link from node i to node j exists; otherwise, g_{ij} is zero. The diagonal (where i = j) is usually undefined because it implies a node with a link back to itself. For undirected graphs (whose links have no direction), matrix G is symmetrical so that g_{ij} is always equal to g_{ji}.
When the strength of a link is important, the adjacency matrix may also be valued, so that g_{ij} may be any nonzero number to indicate the link's strength. A valued adjacency matrix is usually normalized if there is a specific maximum for the strength.
In this section, we will examine various schemas (recursive, many-to- many and star) and illustrate the generation of associative graphs.
Recursive Schema
Graphs are explicitly represented in relational databases through the recursive schema. A schema is recursive when it relates an entity to itself, such as supervisor to employees or bill-of- materials assemblies to parts.
The simplest example is the self- referent schema shown in Figure 1. The table EMPLOYEE has the primary key of emp-ID and a foreign key attribute of supervisor-ID, which points back to EMPLOYEE.
Figure 1: Self-Referent Schema
An instance table shows some typical data, which directly generates the corresponding graph and adjacency matrix, as shown in Figure 2. The three links in the graph are represented in the adjacency matrix with a "1" for n_{1}*n_{2}, n_{1}*n_{3}, and n_{3}*n_{4}.
Figure 2: Self-Referent Recursive Schema
Graphs generated from a self-referent schema are often uninteresting because they are limited to hierarchical structures (i.e., every node will have a maximum of one link into it).
A more interesting recursive schema is the classic bill-of-materials (BOM) schema as used in manufacturing systems. A variation that could be used by a database supporting a customer referral program is shown in Figure 3. An existing customer submits another person's name for some promotion and receives a reward. The program tracks the productivity of this referral by the units purchased. The table CUSTOMER has the primary key cust-ID and some attribute ADDRESS. An instance of table REFERRAL is created every time the first person refers the name of another person.
Figure 3: Recursive Schema and Sample Instances
Note that two one-to-many relationships tie the two tables together. Going counterclockwise (down REFERS and up REFERRED BY), a forward explosion of "who referred whom" would be listed. Going clockwise, a reverse explosion of "who was referred by whom" would be listed.
From the instance table, the graph and adjacency matrix can be directly generated, as shown in Figure 4. The graph is directed because it is important to indicate the direction of the referral.
Figure 4: Recursive Graph and Adjacency Matrices
Although it may not be permitted in an actual referral program, customer #1 refers customer #2, and sometime later customer #2 refers customer #1. This generates a bidirectional link between the two customers. Also, note that customer #3 is referred by more than one person.
The adjacency matrix in the center is not valued so that all links are of equal strength. However, the adjacency matrix on the bottom is valued based on the number of units purchased. Note that person #2 has the best performance because this person has referred the most people and the most units purchased.
Many-to-Many Schema
A more common pattern in warehouse schemas is the V-shaped many-to- many relationship between two entities. The theme is "guilty by association." The instances of one entity type are associated together based how they occur in relationship with another entity.
For example, the table CUSTOMER is related to table PRODUCT through intersection table ORDER, as shown in Figure 5.
Figure 5: Many-to-Many Schema and Sample Instances
The instance table generates a different type of graph than before, as shown in Figure 6. First, there are two types of nodes one for customers (circles) and another for products (triangles). Second, the links are not directed because the direction between customers and products is not meaningful. This is a bimodal graph a graph having two types of nodes.^{3} It often appears in social networks where people have specific affiliations with each other (such as friendships, club memberships and board of directors).
Figure 6: Bipartite Graph and Affiliation Matrices
In the center, the matrix is an affiliation matrix, which is different than an adjacency matrix. Note that it is a 4x3 matrix whose rows represent the four customers and whose columns represent the three products. Like an adjacency matrix, there is a "1" when a link is present and "0" otherwise. Like a valued adjacency matrix, the affiliation matrix can also be valued to indicate the strength to a link. In this example, the sum of a specific product ordered by a specific customer is indicated as the strength of that link.
Directly analyzing the affiliation matrix is difficult. The usual approach is to project the affiliation matrix into two adjacency matrices, one for each entity involved. The dualism with a bimodal graph results in two projections onto normal (unimodal) graphs.^{4}
Figure 7: Products Associated by Customer Orders
Consider the association among products based on the orders by customer, as shown in Figure 7. From the instance table, note that Customer 1 purchased both Product a and Product c and thus associated the two products. That resulted in the link between the two triangles and the n_{a}*n_{c} and n_{c}*n_{a} items in the adjacency matrices. The valued adjacency matrix indicates "14" as the sum of "12+2." Likewise, Customer 2 purchased both Product b and Product c.
Consider the association among customer based on the orders for products. It is like looking at the same coin but from a reversed perspective. Figure 8 shows the instance table ordered on product ID instead of customer ID.
Figure 8: Instance Table Resorted to Show Associations Among Customers
Note that Product a was purchased by both Customer 1 and Customer 4, thus associating the two customers. That resulted in the link between the two circles and the n_{1}*n_{4} and n_{4}*n_{1} items in the adjacency matrices of Figure 9. The valued adjacency matrix indicates "11" as the sum of "2+9."
In general, the adjacency matrix for the first entity is derived from the (non-valued) affiliation matrix A by: g_{ij} = (sum) a_{ik}*a_{jk} for k over columns of the second entity; while the adjacency matrix for the second entity is: g_{ij} = (sum) a_{ki}*a_{kj} for k over rows of the first entity. This projection technique identifies the cliques (i.e., associations among nodes that are fully linked), first within products and then within customers. The important aspect is the overlap among cliques, identifying the similarities among products and among customers. The nodes that are common across cliques play a critical role.
Figure 9: Customers Associated by Products Ordered
Star Schema
The final schema is the star schema, ever present in data warehousing. The focus is on a set of dimensions associated with a key event in a business process. Figure 10 lists some common examples of star schema patterns.^{5}
Figure 10: Examples of Star Schemas
Figure 11: Generalized Star Schema
Figure 11 shows a generalized star schema, consisting of a fact table F and N dimension tables D_{1}, D_{2}, ... D_{n}.
Note that each pair of dimensions is like the many-to- many relationship discussed previously. Each pair can be considered a special association and converted to a bimodal graph with its affiliation matrix and dual adjacency matrices. There is a possibility of N(N- 1)/2 pairs and hence N (N-1) associative graphs for analysis. Considering the simple example of the retail star schema in Figure 10, its five dimensions (time, product, customer, store and promotion) yield 10 affiliation matrices and 20 adjacency matrices.
How should we choose which association of a star schema to study? First, the problem definition should imply which association is likely to give us meaningful insights and effective intervention for solving the problem. Second, the two dimensions involved should have a high cardinality and rich distribution of intersection instances resulting in overlapping cliques. For example, a bad dimension would be "sex" which usually has only two instances and allows for no overlap.
Part 3 of this series will cover: the metrics (e.g., distance, degree, clustering and betweenness) used to analyze graph structures; definition for the SW property; strategies and tactics to foster or deter the SW property within a system; the issues of blanacing size, scope and heterogeneity; the implementation considerations for the ALA methodology; and the similarity to data mining within data warehousing.
References:
1. The terminology follows general graph theory. Nodes are also called vertices, actors or sites. Links are also called edges, ties or bonds.
2. See Chapter 4 of Wasserman & Faust, Social Network Analysis: Methods and Applications, Cambridge University, November 1994, for a full explanation.
3. A bimodal graph is also called a bipartite graph. Ibid., Chapter 8.
4. There is a loss of information because the projection cannot be reversed. In other words, the two unimodal graphs will not uniquely produce the original bimodal graph. Hence, techniques that analyze the affiliation matrix directly are preferred.
5. Adapted from Richard Hackathorn, "Farming Web Resources for the Data Warehouse," DM Review, June 1999.
Dr. Richard Hackathorn is president and founder of Bolder Technology, Inc., a twelve-year-old consultancy in Boulder, Colorado. Richard has more than 30 years of experience in the IT industry and is a well- known technology innovator and international educator, conducting professional seminars in 18 countries. He has written three textbooks entitled Enterprise Database Connectivity, Using the Data Warehouse (with W.H. Inmon) and Web Farming for the Data Warehouse. He earned his B.S. degree from the California Institute of Technology and his M.S. and Ph.D. degrees from the University of California, Irvine. To contact, send e-mail to richardh@bolder.com.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access