Big Data/Analytics Zone is brought to you in partnership with:

Grant Ingersoll is a committer on the Apache Lucene and Apache Solr projects, as well as the current Lucene PMC chair. He is also a founding team member of Lucid Imagination. Grant has posted 11 posts at DZone. You can read more from them at their website. View Full User Profile

Mahout in Action Review

10.17.2011
| 7597 views |
  • submit to reddit

You know your (technical) baby is (almost) grown up when the book on the project finally comes out.  Such is the case for Apache Mahout, thanks to Manning Publications shipping Mahout in Action this week.   So, before I start into my review, let me first say congratulations to Sean, Robin, Ted, Ellen and Manning for producing such an excellent product.   The simplest praise I can give it is to put it on the same level as one of the best intro to technology books I know:  Lucene In Action.  In other words, it sets the standard by which all other Mahout books will be judged.

As for the actual book, it is broken down into 3 sections, which I like to call the “three C’s”:

  1. Collaborative Filtering
  2. Clustering
  3. Classification


So, without further ado, let’s take a deeper look at the book in this context of the three C’s.

Collaborative Filtering

Collaborative Filtering is by far one of the most popular parts of Mahout, being used in places like Amazon and Foursquare and this section of the book, via 5 chapters,  walks you nicely through both the concepts and the practical aspects of collaborative filtering.   Chapter 2 starts by getting you up and running using the GroupLens dataset for movie recommendations.  For those unfamiliar with collaborative filtering, this makes for a nice entrance into the subject with data everyone can relate to easily.  Chapter 3 then discusses how to best model your data, while chapter 4 looks at the mechanics of actually generating recommendations from the data. Chapters 5 and 6 then discuss the ins and outs of taking a recommendation engine into production, including details on how to scale it out using Apache Hadoop.  I found the explanation of the Hadoop based co-occurrence process (via RecommenderJob) especially useful, as I recently just committed MAHOUT-798, which uses it to build an example recommendation system based off of user interaction with email.  In fact, I relied heavily on all of the concepts in this part of the book, as I first had to extract and clean the data, then properly model it before finally running the recommendation task on EC2.

When I first got access to the MEAP for this book (quite some time ago), I did not have a lot of background in collaborative filtering and these chapters really helped fill in the practical details for me as well as provided a good foundation for the theoretical aspects behind collab. filtering.  I think this will serve others well who are looking to get started with collaborative filtering as well.

Clustering

Similar to collaborative filtering, the clustering section starts off by introducing the basic concepts and then quickly gets you up and running with an example clustering run.  Chapter 8 then gets into how best to do feature selection for clustering.  Feature selection is often one of the keys to successful clustering, so be sure to make sure you have a good grasp on the contents of the chapter before moving ahead into chapter 9, which gets into some of Mahout’s clustering algorithms.  That chapter primarily focuses on K-Means and Dirichlet, but also covers a few others.  Note, Mahout actually has a few other algorithms for clustering then the ones described, like spectral, canopy, meanshift and minhash.  Of course, some of these were added later in the book cycle, so it is hard to complain that they weren’t incorporated.    Chapter 10 then covers, in my experience, one of the harder aspects of clustering, namely how to evaluate the results.  This chapter is a little bit thin, but it seems the overall field is the same, so this is not a put down on the chapter!  There simply isn’t a lot of great tools available for evaluating clustering.

Chapter 11 then adds some meat onto the bones of taking clustering to producti0n, including information on leveraging clustering in a Hadoop cluster.  Chapter 12 adds some nice concreteness to the sections by looking at clustering of real data sets from Twitter, Last.fm and Stack Overflow.  For those looking to kick the tires with some real data, be sure to check out that chapter.

Classification

Classification is very popular these days both in search and beyond, so it is great to see this set of chapters covering the topic so well in practical, accessible terms.  As you would expect, the first chapter (13) gets you up and running as well as introduces the concepts of classification.  This chapter has a great explanation of how classification works and a typical workflow for building a classifier.

 

Chapter 14 then delves into the details of actually training a classifier using Mahout’s Stochastic Gradient Descent algorithm as well as it’s Bayesian classifier.

The next chapter then takes a look at how best to evaluate a classifier as well as some insight into what happens when a classifier goes bad.  Be sure to check this out, as you will no doubt run into many of the issues covered.  As an aside, I couldn’t help thinking of the classic “Far Side” cartoon to the right upon reading that section heading.

The penultimate classification chapter digs into the practical aspects of deploying a classifier in production, including details on working through your scale and speed requirements.  It finishes off with an example Apache Thrift based server which some may find as a useful starting point for their applications.  Finally, Mahout in Action finishes off with a Case Study of how Shop It To Me uses a Mahout classifier to provide recommendations of offers to customers.  As with any technical book, it is great to have some concrete discussion of how this stuff functions in the wild.

What’s Missing (i.e. When’s the 2nd edition coming out?)

Mahout has a number of other interesting things that are in various stages of development like frequent patternset mining, Singular Value Decomposition (feature reduction), evolutionary programming, integration libraries for input/output as well as tools for storing data in Cassandra and Mongo.  Since Mahout is developing pretty quickly, the lack of this being in the book is no fault of the authors, I’m just putting it up here so that people are aware that Mahout has more to offer, even if the three “C’s” are the most popular.

All in all, Mahout in Action is an excellent introduction to the project.  Naturally I’m biased, but, pun intended, I highly recommend the book!

References
Published at DZone with permission of its author, Grant Ingersoll. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)