Cheat Sheets for Data Science

Register now

I received an email a few weeks back from a company called DataCamp announcing a comprehensive cheat sheet for the R data.table package. Knowing I'm a data.table proselyte, the co-founder asked if I'd take a look and offer suggestions – and perhaps, I suspect, mention it in my blog.

I'm a big R fan and have been an enthusiastic data.table devotee since I discovered it several years ago. I'm also in frequent contact with package originator Matthew Dowle, and agree with more than a few in the R community that data.table's been a game-changer for elevating R's appeal to data scientists.

The DataCamp cheat sheet's well done and quite handy, covering most functionality in a single page. It certainly helps that the package serves functions – in this case data management, access, grouping and summarization – that are limited in scope. Still, the URL to the page is front and center on my notebook.

Ten years ago, I carried multiple cheat sheets in my backpack. Now, with all the support available in flexibly-queryable online documentation and support sites like Stack Overflow, the need for cheat sheets isn't as acute. I maintain, however, close ties with a few.

The ubiquitous R Reference Card by Tom Short has been my trusty companion for ten years. I'm impressed with how the author managed to catch the guts of  “core R” in such a compact space. Every time I pull it out, I refresh my memory on R minutiae I'd either forgotten – or never knew. Of course, the R ecosystem's now so large that you almost need cheat sheets for cheat sheets. R Task Views to the rescue. For graphics, there're cheat sheets for R's ggplot implementation that are also pertinent to the Python library of the same name.

Before I started with first S and then R about 14 years ago, my platform of choice for data management and statistical analysis was SAS. Now, though SAS is a much smaller arrow in my DS quiver, I  nonetheless use it and language clone WPS semi-regularly. The timeless SAS Cheat Sheet by David Franklin makes it much easier for me to re-connect the SAS dots. Once this gets me back in the game, I simply Google my questions for the online doc answers. One change in my SAS arsenal not reflected in Franklin's page is Proc SQL, which I prefer to data step programming.

Just as I go back and forth between R and SAS for statistical analysis, so also do I migrate between Python and R for more comprehensive data science needs. And Python Basics provides a quick refresher on important basic  syntax to get me going. One lament is that Python dictionaries aren't covered, but for those I just pull up archived code.

As I've noted in previous blogs, Python programming for analytics/data science has changed considerably in recent years with the ascent of community-provided libraries such as numpy, scipy, sci-kit and pandas. Given the array-orientation promoted by those packages, Python DS code can look as much like R as it does core Python.

And with these libraries now a cornerstone of analytics work in Python, it's not surprising there are cheat sheets for them as well. A spartan but still quite useful reference for numpy, scipy and pandas is available here. I use the splendid 10 Minutes to pandas ipython notebook at least every other month. Indeed, the pandas documentation is so good, I hardly ever now open originator Wes McKinney's well-written book, Python for Data Analysis.

Finally, I maintain my scikit-learn sanity by calling on Peekaboo's machine learning reference. Can't tell either the R or scikit ML players without a scorecard.

These cheat sheets are just the tip of the DS iceberg,  a small, select sampling of what's available for data science. I welcome reader input on others they find useful for their data management and analytics work.

For reprint and licensing requests for this article, click here.