I'm a big fan of Amazon for Internet shopping. I bet I spend over $500 annually as a Prime customer on computer and statistical books alone. And most of the time, I get exactly what I expect.
There are those rare occasions, though, when what I get is different from what I thought I'd bought. Alas, a digital table of contents is not always a perfect substitute for good old fashioned in-store browsing.
Such was the case with a recent O'Reilly book I purchased entitled “Programming Collective Intelligence” by Toby Segaran. What I though I bought was a tome, partly on "combining of behavior, preference or ideas of a group of people to create novel insights," and partly a description of APIs to social websites like Facebook, Twitter and LinkedIn, along with agile language code snippets and programs to illustrate their use.
What I got instead was a primer on machine learning with Web applications. The author characterizes machine learning as a "subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn ... an algorithm is given a set of data and infers information about the properties of the data – and that information allows it to make predictions about other data that it might see in the future."
There are two basic types of ML: the first is supervised learning, in which a set of input variables is trained on a known output variable so that new cases of the output can be predicted in the future. Unsupervised learning has no variable to ultimately predict; rather, it serves is to "find structure within a data set where no one piece of data is the answer."
Truth be told, I wasn't in the market for yet another ML book. I have all I can handle with “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman. So I was preparing to send the book back for a refund before I started browsing, in the process noticing stark differences between this and other ML books I own.
Instead of formal logic/mathematical developments of each method with code examples in R or Matlab, PCI explains the procedures in straightforward English, and then programs simple illustrations in Python. Yes, the examples are toy, but simple is often better for first learning. Rather than the forest you often get with many data mining books, this one focuses on the trees. And I must admit, the approach resonated with me. The more I investigated, the more I liked what I saw. After an hour or so, I dropped the plan return the book.
The first model covered is collaborative filtering, a method used by companies like Amazon and Netflix to make recommendations of additional products to buy/rent for customers. The author uses Python's dictionary capabilities to store information on people and their preferences. He then shows how to calculate similarity scores across individuals using Euclidean distance and Pearson correlation. Once the scores are in place, they're ranked and used to determine whose tastes are most similar.
After a pretty stratightforward normalization step, you have the basis of a recommendation system. "All you have to do is set up a dictionary of people, items and scores, and you can use this to create recommendations for any person." Even though the example was a trivial one, I found it comprehensible, the Python implementation straightforward.
The author next takes on the discovery of groups using several important clustering algorithms. First up is hierarchical, that "builds up a hierarchy of groups by continuously merging the two most similar groups." The mostly comprehensible Python example determines the groups and produces a visual – a dendogram – to display the cluster composition. The author also implements popular k-means-clustering, which determines the size of the clusters based on the structure of the data.
I found the narrative on searching/ranking and optimization most informative, with material not covered in standard ML texts. The author constructs a simple search engine, with a crawler and index builder. He takes on word frequency and distance, then codes a page rank algorithm. I must admit that even his simplified code was a bit much for me. I was able to follow the random search, hill-climbing and simulated annealing optimization programs a bit better, having been exposed to similar R packages a few years back.
The additional chapters on usual suspect ML models such as decision trees, kernel methods and support vector machines are pretty good. The algorithm summary at the end with Python code snippets is also solid. These chapters would serve as a nice foundation for taking on the ESL bible.
Overall, PCI's focus is core business applications of ML rather than an exhaustive survey of the latest methods. Indeed, if you're looking to gain a solid conceptual understanding of recommendation engines, clustering, and classification – the core of the Apache Mahout machine learning library, incidentally – PCI's a great place to start. And if, like me, you learn from well-worked toy examples showing real code, PCI might be an important addition to your analytics library.