I quite enjoyed the recent O’Reilly Webcast “Python for Data Analysis” by Wes McKinney. Besides learning about the useful Python shell, IPython, I was introduced to the powerful package pandas, which sits on top of NumPy and matplotlib, adding a wealth of easily-accessible “munging” and data analysis capabilities for Python programmers. McKinney was an excellent presenter, and I can’t wait to get his book this week at Strata + Hadoop World Conference in New York City.

A collateral benefit of the Webcast was exposure to several “munge-friendly” data file folders on historical baby names from the U.S. Social Security Administration. For his analyses, McKinney used the National data, a series of annual U.S. birth name frequency files starting in 1880. The author first consolidated all data into a single CSV file with attributes year, name, sex and frequency – a total of 1,724,892 records. He then manipulated and analyzed that data using the combination of Python and pandas. I was struck by the similarities of this approach with the R Project for Statistical Computing techniques I routinely use to perform similar tasks.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access