I just completed my annual computer “review”. During that period I revisit code written over the years in languages like SAS, R, Python and Ruby to see if and how it’s evolved over time. I may also re-learn something I’ve forgotten.
I adopted Python as a data programming language about twelve years ago on the recommendation of several colleagues, finally pulling the plug on my then-beloved Perl, making the investment in what I hoped would be a more scalable, agile development platform. Turns out to have been a good choice.
As I started learning Python, I often worked with stock portfolio indexes generally available on the Internet as sources of data to help me learn the concepts. I’d read the data into Python structures, wrangle it to suit my needs, do some computations, and then either graph the results using the package matplotlib or send a file to R for further visual scrutiny with lattice. Interesting is how that code changed over the years as I both learned more and was exposed to new capabilities from the growing Python ecosystem.
Consider the following data from Russell Investments. Readily assembled is a list of day-ending values of the overall-market-representing Russell 3000 index, starting in 1995-06-01 and ending 2013-12-16. In base Python, one way to represent this data is with lists of trading dates and index values. The two would look something like this in Python:
rdates = [“1995-06-01”, “1995-06-02”, “1995-06-03”, ,”2013-12-12”,”2013-12-13”,”2012-12-16”],
r3000idx = [1034.42,1034.56,1041.21, ,5063.09,5066.93,5101.00]
The index values are generally of little interest in themselves. Rather, it’s the daily percentage changes that take investors on the psychological roller coaster. The pctch for day t is computed from the index values as 100*( r3000idx (t)/ r3000idx (t-1)-1). There are 4677 such percent changes for the 4678 index values ending 12-16.
My first code for computing the percent changes, circa 2002, used traditional looping and list append as follows:
r3000pct = 
for i in range(1,len(r3000idx)) :
Simple enough, but when I revisited that program several years later, I’d become conversant with the Python “list comprehension”, which represents a more mathematical, functional programming way to construct a list from other lists. The code is similar to the above but more elegant and compact and in addition can be assigned:
r3000pct = [100*(r3000idx[i]/r3000idx[i-1]-1) for i in range(1,len(r3000idx))]
A big computational programming boost for me came when I started working with the Python numerical library numpy three years ago. Among numpy’s many productivity benefits is a powerful multi-dimensional array construct with methods that operate on entire structures at once. So rather than “loop” through the individual list items, numpy, like R, supports whole array operations. For our example, once the numpy library is imported, the following array statement suffices:
r3000pct = 100*numpy.diff (r3000idx)/r3000idx[:-1]
About a year and a half ago, I started working with pandas, “an open source, BSD-licensed library providing high-performance, easy-to-use data structures and analysis tools for the Python programming language .. Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python”.
Indeed, pandas is now ubiquitous in the Python data science world, its structures fast becoming the standard for computation. Built on top of numpy, pandas brings the many benefits of object orientation to the programming environment. The pandas code for the percent changes once the library’s imported? Just invoke the pct_change method from a Series object:
r3000pct = 100*pandas.Series(r3000idx).pct_change()
Not surprisingly, my Python programming motif has changed considerably over the years as I’ve adopted these productive infrastructures built on top of base Python. Much like with R, the community-developed libraries enhance the base language immeasurably and change programming style as well. Arrays, iterators and methods supplant lists, loops and functions.
Why is this important for data science programmers? There’re at least two reasons. As with R, the proliferation of community-developed packages validates the Python platform. Python’s not going away and is in fact rapidly growing -- fast-becoming a data science language of choice.
Second, the productivity benefits of foundational libraries like numpy and pandas are significant. Now with Python, as also with R, one of a programmer’s most important tasks is to research for readily-available packages germane to her work before starting to develop programs. Better to use proven, productive, reusable code than to program from scratch.
And this is good news for data-driven business computation.