Most data-mining algorithms struggle when dealing with rarity—yet rarity is a hallmark of fundamental business and research applications of data science.

Data scientists, therefore, must understand the challenges of rare cases and rare classes of data, and be prepared to address those challenges through appropriate data-sampling and other corrective techniques.

Just as important, they must be prepared to communicate the findings of this appropriately adjusted analysis to their audience, at all levels of the organization including the C-suite.

Without the context of why these corrective techniques extend insight into previously underexplored or incompletely understood subsets of your data, your findings might be dismissed as mere outlying anomalies—when in fact they are the very outlying and/or previously hidden opportunities you’re being asked to identify.

Rarity and unbalanced data

As previously noted by Weiss, rarity is often correlated with being of interest. His observation that “[i]n life it is often the rare objects that are most interesting” still serves us well—inviting perhaps the complementary rejoinder that, in life, interesting is often a synonym for complex.

Real-world examples of these rare and interesting objects include fraudulent credit card transactions (which represent a fraction of total transactions), telecommunications systems failures (of which there have been relatively few), patients who are readmitted to the hospital after being discharged with a specific diagnosis (when most patients who receive that diagnosis are not readmitted), and even uncommon words or pronunciations of words in the context of natural-language processing.

The core challenge in all of these is dealing with unbalanced data. For the purposes of this article, unbalanced data can be understood to mean:

• The object of interest (the rare object you’re trying to predict) is very infrequent with relation to the alternate objects (the common objects that make up the overwhelming majority of your dataset).

• Object of Interest = Class 1; Alternate Objects = Class 0

(The frequency of Class 1 in the data is many, many times less than the frequency of Class 0 in the data.)

Unbalanced data is problematic in the context of predictive modeling because predictive-modeling algorithms typically strive to maximize accuracy. If 2% of your dataset consists of the interesting (rare) object and 98% consists of the uninteresting (common) object, your model could identify every object as uninteresting and be correct 98% of the time. That’s a pretty good accuracy rate—but it’s functionally useless in the context of rarity since this simple model will label everything as uninteresting.

Now you begin to see the problem.

Example application: Hospital readmissions

Since I currently work in healthcare, I’ll explore the example of hospital readmissions. First some context:

The movement in American healthcare is toward a value-based model whose backbone is risk-scoring/risk-stratification. These risk models are focused on managing distinct patient sub-populations—ensuring the sickest patients don’t get sicker, helping patients who are most at risk of getting sick (or getting sicker) take the best possible steps to improve or stabilize their health, and creating strategies that keep healthy patients healthy while heading off risk down the road as they potentially shift to a less healthy state.

The financial underpinning of value-based care is that it ties incentive to cost, as well as to patient outcomes. It links physician reimbursement to patient healthcare utilization—with the goal of encouraging patients, where appropriate, to utilize lower-cost preventative care or primary care facilities vs. an ER or urgent-care setting. Hospitals are now financially penalized for unplanned or unnecessary readmissions, and the reasoning behind this spans the goals of value-based care. Readmissions drive up costs, increase risk of a hospital-acquired infection, and represent a lower quality of care because by definition they are considered an undesirable outcome that could have been avoided.

Further compounding these challenges for healthcare data scientists is that the determination as to whether a readmission is unplanned or unnecessary is based on the diagnoses and procedures documented in the patient’s healthcare data. These may be subject to human data-entry error or omission and/or be fragmented across different healthcare providers or non-interoperable IT systems. The relevant information may exist in non-standardized or unstructured form, making the objects of interest rarer still because they may exist as non-uniform iterations within an already rare class of data.

And here’s how I addressed those challenges with strategies that you can apply to your own work with unbalanced data.

• Adjusted the algorithm parameters to (attempt) to identify the rare cases. (e.g., lowered the threshold for the frequency of a given relevant data point or set of data points within a given class of data.)

• Implemented corrective data sampling to create a more balanced dataset. (.e., created training datasets in which the proportion of rare cases is large enough so they are identified.)

• Instructed the predictive modeling algorithm to focus on minimizing cost (understood as the least desirable outcome) rather than on maximizing first-pass accuracy. (i.e., built in an understanding of relative value that appropriately mirrors the risk of under-identifying the object of interest; one that accurately reflects the true cost of a false negative vs. a false positive. Another term for this strategy is “cost-sensitive learning.”)

The absence of strategies to address unbalanced data inevitably leads to over-identification of common-class data and will hamstring your efforts to understand (or even be made aware of) meaningful patterns across rare cases or within rare classes of data. Conversely, correcting for and mitigating the impact of unbalanced data will enable you to be more effective more immediately. You’ll find the true niches in rare- and common-class data, and uncover the most interesting objects hidden therein—and the validated models you construct will identify those interesting instances within new, previously unseen data.

Returning to the hospital readmission example: by applying strategies that effectively address unbalanced data, you can identify patients likely to readmit before they actually do. Healthcare providers, armed with this knowledge, can then intervene and take steps to head off the readmission, resulting in lower cost to the overall system and improved patient health. And of course, patients are happier with this approach too. After all, who wouldn’t prefer to avoid an unnecessary stay in the hospital?

(About the author: Dr. Paul Bradley is chief data scientist at ZirMed)