On Monday, I made my way to San Jose for Strata-Hadoop World. I've been to all six Stratas in Silicon Valley, the first four in Santa Clara. The hardcore sessions/keynotes are Wednesday-Thursday, but over the years I've generally preferred the Tuesday tutorials, which are less crowded as well as better learning vehicles. Last year's all-day presentations on academic machine learning left my head spinning; several years ago it was eight hours of Spark that made me a big data computation proselyte.
This year I decided to do an all-day, hands-on session tutorial Tuesday, the choices a tough call between PyData and R. In the end I chose PyData at Strata since I'm somewhat more more advanced in R and wished to be exposed to the Bokeh interactive visualization platform.
Most data scientists are familiar with PyData as "a gathering of users and developers of data analysis tools in Python.....We aim to be an accessible, community-driven conference, with tutorials for novices, advanced topical workshops for practitioners, and opportunities for package developers and users to meet in person."
The organization supports multiple world-wide developer gatherings each year, and takes a lead in the support of an open source data science platform revolving on Python. NumFOCUS, "a 501(c)(3) nonprofit that supports and promotes world-class, innovative, open source scientific software" was also instrumental in the delivery of the tutorial.
PyData at Strata was excellent top to bottom. The 9-5 timeframe was entirely allocated to data science with Python, covering all the DS bases with data management in Pandas, visualization in Bokeh, and machine learning in sckit-learn. The instructors, TJ Alumbaugh for Pandas, Bryan Van de Ven and Sarah Bird for Bokeh, and Jake Vanderplas for scikit, were both experts on their topics as well as solid presenters.
What made the day, though, was the strategic use of Jupyter notebooks as training materials by each instructor. The self-contained notebooks combine the power of markdown language that displays like PowerPoint lectures with code and resulting output for live demos. Add to that accommodation for exercises to test/extend student understanding, and you have the basis for a terrific, self-contained teaching platform. And students like me can revisit the notebooks to pick up lesson finer points as well as use them as reference guides. I'd begun using Jupyter and markdown for technical blogs and training materials with both Python and R before the tutorial and will certainly double down now.
I consider myself at an intermediate level with Pandas, but left the tutorial with a more nuanced understanding of several areas, including series, hierarchical indexes, and Pandas functioning with numpy and matplotlib. During session breaks, I even started editing several of my old notebooks. I also learned a few nice Pandas options/defaults and how to better maneuver around Jupyter.
I'd played with browser-based Bokeh a bit with modest success but didn't appreciate its full power. The instructors presented the platform at three progressive levels of complexity: charting, plotting, and modeling. Charting provides basic interactive capabilities such as scatter, histogram, dot plot and line plot in single statement invocations. Great for basic visual exploration.
Plotting brings much more granular control of appearance and interactivity with detailed function calls. I especially like the ability to link multiple graphs by sharing data sources. For R programmers, I'd contrast Bokeh's charting/plotting with ggplot2's qplot and ggplot function calls.
Models and the Bokeh Server are tools for building sophisticated web-based analytics apps. I now see the tandem of Python/Bokeh as a formidable challenger to R/Shiny.
scikit-learn can't compare to R in the volume of machine learning models accessible to developers. Where it does shine, though, is in its consistent, elegant object-oriented programming interface. Working with scikit has the feel to me of working with the R caret (Classification and Regression Training) package that attempts to impose order on wide disparities in R ML package interfaces. An academic astronomer, the instructor was clearly on the machine learning side of the statistics-ML divide – and I considered that a good thing.
Well, I'm off to the keynotes/sessions Wednesday. I can only hope they measure up to tutorial Tuesday.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access