Nominal variables are categories with no implicit order. Sex, race and alive/dead are examples. Ordinal measures add ordering to the categories. An illustration is a customer satisfaction measure that ranges from very dissatisfied to very satisfied, represented by levels 1 to 5. Another illustration is education, with “Not a HS grad”, “HS grad”, “Some college” and “College grad” as levels. If the distances between each level along the scale can be interpreted as equal, the measure is called interval. Examples of interval variables are temperature, IQ and aptitude test scores. Finally, if an interval variable has a real zero point, it is then identified as ratio. Heart rate, revenue, profit, customer value and churn percentage are examples.
It seems logical that analytics would be best served by measurement that provides the most information. So, other things equal, ratio/interval variables should be preferred to ordinal attributes, which in turn should be seen as superior to nominal. But sometimes the analytics tail wags the dog, so interval variables, both predictor and outcome, might be “cut” into ordinal categories to accommodate categorical data analysis, logistic regression, recursive partitioning and other statistical learning techniques. Cut age into 15-year interval categories; break customer value into low, medium, high; translate the range of profit into profitable and unprofitable. In each of these instances, valuable information is lost by going from numeric to categorical. Alas, the practice of converting interval to ordinal or nominal variables is all too common in BI.
In their book, “Data Mining: Practical Machine Learning Tools and Techniques,” authors Ian Witten and Eibe Frank present techniques for “discretizing” interval/ratio variables to accommodate the needs of specific machine learning models that mandate categorical attributes. They contrast “unsupervised” discretization that categorizes solely on variable values to “supervised” discretization that uses the dependent variable in tandem – and searches for optimal break points with error or entropy-base algorithms. Seems a lot like data snooping to me.
A good illustration of “dumbing down” interval/ratio variables to ordinal comes from the American Association of Medical Colleges data on medical school admissions that I blogged about a few weeks back. The AAMC presents medical school admissions by college GPA and MCAT (Medical College Admission Test) scores, cutting the interval MCAT score into 10 ordinal categories and the ratio GPA into 11 to create the presentable table of applicants/admits. The cuts were surely made to present the entirety of the data on a single page.
An inspection of the table from bottom left to top right shows, not surprisingly, a very strong positive relationship between MCAT scores, GPA and likelihood of admission to medical school. Geek that I am, I scraped the figures into an R data set and performed a logistic regression, confirming statistically what I'd surmised from the table: strong positive relationships between MCAT score and GPA with the probability of admittance to medical school. I'd sure love to see predictive significance like I found here in some of the BI tests I perform!
Even with the powerful established relationships, though, there's much being left on the prediction “table” by using ordinal category cuts rather than the original interval measurements of MCAT score and GPA. Consider the cells of 3.40-3.59 GPA and 27-29 MCAT with a 38.6 percent acceptance rate, and 3.60-3.79 GPA with a 30-32 MCAT that has 73.2 percent acceptance. These intersections summarize admission performance across a range of MCAT scores and GPAs. It's almost certainly the case that the likelihood of acceptance for an applicant with a 3.4 GPA and 27 MCAT is less than .386, while the chances of those with a 3.79 GPA and 32 MCAT are greater than .732. But by discretizing the data, we've thrown away information that could provide the more precise predictions.
Simply having raw MCAT score and GPA with an accept/reject designation would be vastly superior for prediction than the cuts given in the report. The categories show us black and white, when reality is gray. In addition, using the ordinal-level MCAT and GPA categories compromises the ability to investigate more sophisticated non-linear relationships between the predictors and outcome. Finally, nominal or ordinal variables are more statistically costly, consuming scarce “degrees of freedom.” You pay more and get less!
Vanderbilt professor and R elder Frank Harrell isn't timid in expressing his thoughts on the practice of discretizing interval measures – both predictor and outcome – in his field of biostatistics. Frank's keynote for the annual international R user group conference (useR!2100), Information Allergy, pulls no punches. Information Allergy “is defined as (1) refusing to obtain key information needed to make a sound decision, or (2) ignoring important available information.”
Frank discusses a variety of data and analytics no-no's for the medical sciences, many of which have to do with analysts failing to use all the information they're given. His position on “discretizing” numeric attributes is perhaps best summarized by the heading of two of his presentation slides: “Cutpoints are Disasters.” He notes that the use of cutpoints to estimate the high:low effects of risk factors “results in inaccurate predictions, residual confounding, and are impossible to interpret.” Almost comically, Frank cites a research paper that demonstrates with examples that “Cutpoints may be found that result in both increasing and decreasing relationships with any dataset with zero correlation.” In other words, analysts can find support for both positive and negative cuts of the data – even when the data are generated randomly!
Lesson for BI: Use all the measurement info you can. Be wary of dumbing down both explanatory and outcome variables to suit a specific analytical method.