This fall, prestigious Columbia University in New York City is offering a course entitled “Introduction to Data Science,” taught by a team under the direction of Google Statistician and Columbia Assistant Professor Rachel Schutt. The class is an outgrowth of the recently-created Institute for Data Sciences and Engineering, a joint initiative between Columbia and New York City.
According to the class Web page, “This course is an introduction to the interdisciplinary and emerging field of data science, which lies at the intersection of statistics, computer science, data visualization and the social sciences.” At the end of the semester, students should understand what it’s like to be a data scientist and “be able to do some of what a data scientist does.” Prerequisites include some knowledge of linear algebra and basic probability and statistics as well as basic programming skills
Instruction themes revolve on statistics/machine learning, data programming languages and big data tools that are embellished by case studies. Delivery consists of core content lectures, labs and guest presentations from selected data science experts. The initial class of over 60 includes students from a variety of disciplines representing undergrads, graduate students and faculty. The course attempts to address the needs of both pre/working professionals and academic researchers.
Each week highlights a DS topic reinforced by a statistical/ML content lecture from Schutt, a related lab exercise, and an illustrative guest lecture by a DS practitioner. Topics include exploratory data analysis, visualization, supervised and unsupervised learning, logistic regression, decision trees, time series, sampling, experimental design, recommendation engines, causal modeling, social network analysis, data journalism and data engineering obsessed with big data.
The recommended class texts are the usual outstanding data science suspects on machine learning, probability and statistical analysis, programming with R, Python and Hadoop, and visualization. 80% of the course grade is determined by performance on homework assignments and a team-based class project modeled on Kaggle competitions. Assignments and project work are generally completed in R and Python. Assignment 1 introduces R for basic data programming and visualization, and has students develop a data strategy for RealDirect, a website designed “to make selling and buying a home easier.”
“Introduction to Data Science” looks to be a great start to a curriculum that would legitimize data science as an area of academic inquiry. Schutt acknowledges that IDS is a version one product that will evolve. But as a stake in the ground, IDS appears to cover most data science bases.
My bet is that there’ll be a one year Masters in Data Science program emerging from the Institute for Data Sciences and Engineering in the near future. What would also be nice is an undergraduate certificate program involving, say, three or four core data science courses for quantitatively-oriented students. Instruction would be in applied statistics, machine learning, data/statistical programming and visualization. Sign me up!