I had a nice conversation with an old stats colleague the other day. He's a SAS analyst still, while I've mostly migrated to R, though I like SAS-clone WPS a lot as a data science platform. We reminisced about the 80's and 90's of statistics and pre-data warehouse analytics – a time when SAS was not only our statistical package of choice but also a primary data integration programming platform. The discussion evolved to the topic of data exploration, which occupies much of our DS effort today. Fortunately, the tools we have to explore data have advanced considerably over the years.

My friend an I agreed that the first priority with a new data set revolves on determining the distribution of values for each of the attributes. Initially, we wish to see frequencies for the responses of each variable. Those give us a general sense of the data, its distribution and its quality. For categorical attributes, we prefer to visualize frequencies sorted from most to least in an unadorned graphic; for numeric attributes that assume many different values, we like histograms – and perhaps even the more sophisticated kernel density plots – to detail the shape of the data. In the end, we decided to share the tools in our chest for showcasing data distribution -- mine in R, his in SAS.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access