Part 1 of this article (August 2003 issue of DM Review) reviews the work in network analysis of complex systems, particularly the recent research into the small-world (SW) property, aristocratic-egalitarian (A-E) distinction and tipping points. Part 2 (September 2003 issue of DM Review) applies these concepts to the business intelligence (BI) and data warehousing (DW) fields with a new methodology called Associative Link Analysis (ALA) by discussing the translation of typical warehouse schema into an associative graph form. This article, the final in the series, describes several metrics for analyzing graphs, strategies and tactics based on the SW property, and implementation issues.
The next step is the analysis of the graph structures generated from the warehouse schema. Many of the metrics (or analytics) for characterizing graph structures have emerged from work in the social network analysis (SNA) field that studies interactions among people in terms of friendship, marriages, authority, citations and the like.1 Within the context of any business system, associations will have a social basis even though the unit of analysis may not be a person but an object (e.g., product) or event (e.g., sales transaction). Following are several metrics that can give practical insights into nature of these graphs.2
Distance (or path length) between two nodes is the number of links in the shortest (or geodesic) path between the nodes. For the entire graph, the average distance is computed over all pairs of nodes.3 This is the "degree of separation" as the primary characteristic of the SW property, in which any node seems close to all other nodes. Studies indicate that the average distance in real systems varies from two to eight, while the number of links may be in the thousands or millions.4
Degree of a node is the number of links connecting with it.5 By computing the frequency distribution on degree, we can see how many nodes have one link, two links, and so on. In Part 1, we discussed hubs in scale-free networks as nodes having a very high degree, far greater than the average degree. Hubs play a key role in enabling the SW property in aristocratic systems. The degree distribution is usually compared with a power-law curve and characterized by the exponent * of that equation. The difficulty with this metric lies in estimating its tail because of the wide variability in high-degree nodes.
Clustering is a simple concept with multiple interpretations. In social systems, clustering is the probability that a friend of your friend is also a friend of yours. Consider a triangle of three nodes that have at least two links. Clustering is the probability that the third link is also present. Therefore, find all the triangles in the graph and calculate the percentage of those triangles that are complete. Clustering is important because it indicates the formation of cliques that can efficiently interact.
Betweenness of a node is the number of paths between two other nodes that connect through this node. It can be viewed as a measure of resilience, indicating the impact that the removal of a node could have on the graph. Studies of the betweenness distribution follow the power-law curve like degree distribution; hence, this metric may lead to a different interpretation of the SW property. In other words, a node may not be a hub, but it can be Grand Central Station for many other nodes.
PageRank measures the authority (i.e., relevance or importance) of a node by summing the "votes" of nodes pointing to it. Based on the work by Larry Page and others, this metric has become infamous through its prominent use by Google.6 The algorithm is iterative because an initial guess is iteratively refined, converging on the correct value. Although applied extensively to Web, its usage in other domains is gaining interest.
Component is a subgraph in which there is a path between all of the nodes, and there are no more nodes with paths to any of those nodes. Simply stated, picking up one node of a component will pick up all the other nodes of that component and no others. Each component is like an island unto itself. There are no links between components; thus, there is no way for nodes to interact between components. Analyzing the nature and distribution of components in large systems is important.
Other metrics abound. The research community in social network analysis has a rich heritage and active research. Centrality, reciprocity, reachability, prestige, balance, cohesiveness, equivalence, social capital and assortative mixing are examples. The challenge is to assess whether the metric produces practical insights into business problems as reflected by enterprise data.
Analysis of the process dynamics behind the creation and deletion of nodes and links is an emerging area for studying the SW property.7 An example of such an analysis is the SIENA model.8 Most analyses are static in that only one snapshot of the system is captured. Because data about social systems is often difficult to collect (via questionnaires or direct observation), longitudinal studies with multiple snapshots are rare. However, the situation is different with the enterprise data warehouse. Data is constantly changing from the effects of thousands and even millions of transactions per day. Instead of studying changes on a monthly basis, certain business problems may warrant analysis on a daily or even hourly basis, such as that with telephone activity or equity trading.
Small-World Property Defined
Through this article, our focus has been the SW property; yet, we have not defined the term except to paint its effect at social gatherings. The technical definition is a subject of continuing debate among researchers. The general consensus is: a system (as abstracted into a graph structure) exhibits the SW property when there is a low average distance and a high average clustering. The comparison (of "low" and "high") is to a "random" graph (i.e., links inserted between randomly selected pairs of nodes) of equal size. Within SW systems, there is an efficiency that maximizes connectivity while minimizing redundancy.
Strategies and Tactics
Based on the graph analysis, we must plan the strategies and tactics needed to take action. The approach is to focus on the SW property looking for indications of tipping points within the systems studied. The strategy is to intervene by fostering or deterring the SW property from engaging. In particular, there are two issues: determine whether the system currently exhibits the SW property and decide whether the system in the future should exhibit the SW property. Figure 1 shows the four resulting strategies.
Figure 1: Strategies Involving Small-World Systems
- Destroy (SW - -> nonSW): A system currently exhibits the SW property, which we wish to eradicate. In this situation, the SW property is working against the interests of our company. For example, the SW property has created an irrational urge within the market to purchase a competitor's product.
- Prevent (nonSW --> nonSW): A system does not currently exhibit the SW property, which we wish to avert. In this situation, the SW property is again not in our interests. However, we are in a reactive stance (i.e., keeping things the way they are), rather than proactive. For example, a viral epidemic among a population will spread rapidly once it achieves the SW property.
- Defend (SW --> SW): A system currently exhibits the SW property, which we wish to retain. In this situation, the SW property is in our interests. For example, a system that is the word-of-mouth networking among our customers should retain the SW property.
- Achieve (nonSW --> SW): A system does not currently exhibit the SW property, which we wish to attain. In this situation, the SW property is in our interests. For example, the SW property would be highly beneficial for creating the buzz about your product among the targeted market segment.
To intervene with these strategies, we can employ the following tactics:
- Hub- centric: Identify the hubs (nodes having an unusually high degree) and assess their role in the overall system. Match the degree distribution to the power curve to determine if the system is an aristocratic approach to SW. If so, focus efforts on the hubs.
- Cluster-centric: Identify the strong clusters (or cliques). Then, identify the links (i.e., bridges) or nodes (i.e., cutpoints) connecting dissimilar clusters and assess their role. Focus efforts on those bridges/cutpoints.
- Resiliency-centric: Identify the nodes that have a high "betweenness" factor and assess their role of providing resiliency to the system. Focus efforts on those nodes.
- Authority-centric: Identify the nodes that have a high PageRank and assess their role of authority or importance in the system. Focus efforts on those nodes.
Using these tactics, take action depending on the strategy, as shown in Figure 2. Figure 2 is a simple paradigm for translating graph analysis into problem intervention with the objective of resolving the targeted business problem.
Figure 2: Tactics to Implement the Strategies
Size, Scope and Heterogeneity
When analyzing graph structures and planning intervention strategies, analysts must deal with three delicate issues: size, scope and heterogeneity.
Size lures analysts into a false sense of simplicity. When dealing with one hundred nodes or less, graph structures seem appealing, displaying unusual textures taunting for some interpretation. When dealing with one thousand nodes or more, graph structures seem unapproachable, concealing their mysteries in dense clouds of lines and dots. Being human, we scale the solution to fit our abilities and deal with only one hundred nodes or less. However, the reality of enterprise data just begins to be captured with one thousand nodes. Further, the SW property becomes interesting and, I would argue, increasingly useful at this level of complexity. We define this level as very large graph structures (VLGSs).
Scope compels analysts to be bipolar with VLGSs. Analysts are torn between local metrics (characterizing individual nodes or links) versus global metrics (characterizing the entire graph). Most analyses dealt with small graphs so that focusing on the role of individual nodes is relatively close to focusing on the behavior of the entire system. With VLGSs, this is not the case. Analysts are either lost in the forest using local metrics or lose sight of the trees using global metrics. Both extremes are undesirable, leaving no middle ground.
Heterogeneity hides important variations from analysts. There are significant hills and valleys in VLGSs. For example, the elevation of Boulder, Colorado, is 5,430 feet; and the average elevation of the State of Colorado is 6,800 feet. Those statements hide the fact that Colorado is divided equally, north to south, into flat plains and high mountains. Analyzing the whole of Colorado with the same global metrics produces misleading results.
These issues size, scope and heterogeneity present several dilemmas for analysts to select the right balance.
The first dilemma for analysts is the lack of adequate VLGS tools. These analyses consume much computing power. Many graph algorithms (such as finding shortest path between two nodes) take on the order of N2. This implies that an algorithm that takes one second for a thousand nodes will take twelve days for a million nodes and 318 centuries for a billion nodes. In addition, VLGS visualization has inherent limitations. A person is an excellent pattern analyzer for graphs up to several hundred nodes. However, a VLGS is impossible to visualize in any meaningful sense using traditional techniques.9
Practical solutions require improvements in the current generation of graph analysis tools.10 These improvements consist of: faster algorithms, sampling techniques, detail aggregation (zoom-out function), subgraph selection and process dynamics. More importantly, there is a critical need for an integrated analysis environment. The current generation is characterized as "no more than a grab-bag of miscellaneous and largely unrelated tools."11 The next generation of VLGS tools should take the cue from current integrated development environments, such as the open source Eclipse or Microsoft Visual Studio.
The second dilemma is to leverage the strengths of the enterprise data warehouse environment to manage VLGSs. A graph structure with one million nodes and roughly the same number of links will require approximately 500MB for data and indexing, assuming a recursive bill-of- materials design and efficient storage mechanism for sparse matrices. The capability of the database engine should also be leveraged, performing computation as close to the data as possible. SQL statements, views and temporary tables should be used, along with embedding graph algorithms into SQL functions so that selected portions of an adjacency matrix would be directly materialized from the database engine.
The third dilemma is the ability to make the analysis actionable. This implies that once we understand the nature and dynamics of associations within enterprise data, managers (working closely with analysts) are able to take quick action to resolve the business problem. In particular, the BI infrastructure around the data warehouse should be leveraged to enable VLGS analyses to drill-down to the atomic entities with its detailed data, trigger alerting applications for abnormal business conditions and publish dashboard indicators monitoring key performance variables.
The story about applying the SW property to analyze enterprise data has just begun!
The situation is similar to data mining seven years ago. Data mining was a separate discipline from data warehousing, evolving from fifty years of statistical research. The early data mining applications required the presence of at least one doctorate in statistics. Large extracts were performed upon the data warehouse. Labor-intensive preparation of the data as large sequential files was needed before analysis could begin. Computation was performed in batch runs, generating reams of printed output. Weeks and even months elapsed before any useful results emerged from the effort.
In contrast, data mining today has been integrated into most DW architectures and products. Its usage is another click on the analysis profile, with the appropriate checks and explanations embedded within the tool. Practical applications of data-mining techniques are an expected daily part of many DW operations.
In fact, data mining is synergistic to ALA in several ways. Data mining can statistically uncover the latent associations within the data. Then, ALA can explore the structural implications of those associations. In addition, the infrastructure for data mining within the enterprise data warehouse is similar to that required for the ALA methodology.
By applying the SW concepts to data warehousing via the ALA methodology, we can reap additional benefits from our investment in the enterprise data warehouse. This data is rich in describing the detailed activities of our business. So far, we have not tapped into an understanding of the structural interactions within those activities.
Business dynamics are changing. In the future, competitive advantage may rely more on avoiding disruptions in critical business processes than unleashing new capabilities. For instance, the reliability and security of accepting an order from a customer may be more important to that customer than some fancy "one-click" feature on the Web site. ALA can surface critical clues as to when our complex business systems may become unstable, behaving in unexpected ways to slight fluctuations in market or world dynamics.
1. See http://www.bolder.com/ALA/ for additional references about social network analysis.
2. Newman, M.E.J. "The Structure and Function of Complex Networks," SIAM Review 45, 167-256, 2003 (arXiv:cond-mat/0303516), along with Wasserman et al. Social Network Analysis: Methods and Applications, Cambridge University, November 1994.
3. A related metric is diameter, which is the maximum distance, as occurring between the two most distant nodes.
4. Newman, ibid., Table II. This is an amazing table when you consider the diversity of domains.
5. For directed graphs, there is both an in-degree for incoming links and out-degree for outgoing links.
6. Page, Larry et al. "The PageRank Citation Ranking: Bringing Order to the Web, 1998, http://citeseer.nj.nec.com/page98pagerank.html. Excellent explanations of PageRank (as believed to be implemented) by Google are given at: http://www.iprcom.com/papers/pagerank/index.html and http://www.webworkshop.net/pagerank.html.
7. Newman, ibid., Section VIII. An excellent discussion that covers applications of percolation theory, epidemics (diseases, rumors, computer viruses), network navigation, phase transitions, self-organizing oscillators, and other fun stuff.
8. Huisman, M. and Snijders, T.A.B. "Statistical Analysis of Longitudinal Network Data," July 9, 2003, http://stat.gamma.rug.nl/snijders/chcomp.pdf. The SIENA program is described at http://stat.gamma.rug.nl/snijders/siena.html.
9. Newman, ibid., p. 2.
10. See http://www.bolder.com/ALA/ for a list of tools for graph analysis and visualization.
11. Newman, ibid., p. 47.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access