You recall some of the great debates of our times: Kennedy vs. Nixon, Internet Explorer vs. Netscape, Java vs. ActiveX and MOLAP vs. ROLAP. One of these debates had a clear winner (Kennedy). The other debates still rage on, with no declaration of a clear winner. The MOLAP vs. ROLAP debate, most familiar to those of us who are enabling our organizations to gain insight through business intelligence tools, has been hot for awhile. In 1996 Gartner Group declared that ROLAP would be the winner; however, Arbor (with Essbase) and other MOLAP solution providers have been busy taking the opposing viewpoint quietly and very successfully.
Now a new debate is commencing--a debate that promises to contend for the time and attention of data warehouse and business intelligence managers. This debate centers on whether to do data mining on the whole database (the whole enchilada) or on a sample set of records. The issue arises as data warehouses approach multi-terabyte size and detail is retained for all source records within an enterprise. Technology has now progressed to the point where there is computing power available to mine larger datasets. Where previously we could not mine it all, now with SMP and MPP solutions, it can be done. That makes some (including vendors of high performance decision support database engines) say that sampling becomes less critical, if not irrelevant.
Others, including Gartner, argue that just because mining the whole enchilada can be done does not necessarily mean that it should be done. Gartner contends that successful data mining efforts should focus more on data quality than on the size of the database. Data miners typically spend 60-80 percent of their time addressing data quality issues before they can get down to the task at hand. Data quality in a large heterogeneous data warehouse populated from tens or hundreds of sources is suspect; and, as a result, data cleansing becomes a critical aspect of data mining. In addition, the potential exists within large heterogeneous databases for semantic misalignment of data. "Balance" has the potential to mean ledger balance, average balance, ending balance, collected balance, cycle-to-date balance, net balance, etc. Mining on balance data without reconciling these semantic differences can yield deceptive results. This data may be clean; it just means different things. A practical use of sampling or subsetting based on knowledge of these semantic differences could produce more believable results.
What about sampling? One side of the argument says that sampling can bias the results of data mining and that mining more data will yield more interesting patterns and relationships among the data. Yet sampling has been used very successfully for years to reduce the record set to a manageable size; statisticians have found ways to mitigate the risks of creating possibly spurious relationships through the process of sampling. In fact, one question to ask when engaging in an internal organizational debate on the subject is whether there is a need for quick turnaround with an answer. Mining the whole enchilada could take some time; sampling can usually obtain results in much less time.
Procedural or mathematical complexity is another basis for dispute within the debate. Larger size datasets can require more data-specific, and thus complex, procedures to obtain value from them. At least one vendor--Tandem--is putting specialized data manipulation functions required by data mining algorithms into its NonStop SQL/MX database engine, allowing mining against the whole enchilada. This may mitigate the mathematical complexity issue somewhat; however, mining against smaller subsets of data remains a less complex task.
Which side will win the data mining debate? As usual, there is no easy answer. Indeed, there may not be a clear winner for some time to come. Certainly the largest organizations with significant parallel processing resources available to them will want to pursue data mining against the whole enchilada. They will undoubtedly be able to obtain business value through uncovering trends, patterns and relationships that may be obscured using sampling. However, the cost and complexity of this solution will encourage many organizations to see what can be gained from intelligent, judicious use of clean, semantically reconciled data subsets. We will assuredly see more announcements from vendors claiming expertise in mining the whole enchilada or in mining a sample/subset. These announcements will be thought provoking. While we yearn for the proverbial silver bullet, our choice, as always, remains situational.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access