I'm not sure if I should balance my data for predictions or decision trees or not: Imagine you have 100 cases (customers). 20 answered your first direct mail, 80 customers did not. For your next direct mail you want to select only profitable customers (who may answer your mail) with binary logistic regression or maybe a decision tree. Dependent variable is "answer" vs. "no answer." Do you use your sample like it is (20 customers vs. 80 customers) or do you first balance your sample like taking all the 20 customers who answered your first direct mail vs. a random sample of 20 customers who did not?


Clay Rehm’s Answer: This may sound callous, but 80 of your customers may have not answered the direct mail because the direct mail was so terrible they threw it away. Or they are so busy they did not answer. By the time they were going to answer, the deadline had passed. So, as to the question of balance, I recommend you first understand the business requirements and how the marketing plan was carried out. You could make the assumption that who ever did not answer the direct mail may be too busy. And if you are too busy, a safe assumption is that you are too busy because you are successful, and if you are successful you are profitable. You will also need to find out if all 100 cases are existing customers or potential customers and what warranted them to be contacted with direct mail.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access