Big Data/Analytics Zone is brought to you in partnership with:

JBoss Seam developer. Current tools/APIs: JSF, Facelets, Richfaces, JBoss AS, EJB3, JPA, Hibernate, Eclipse. Arbi has posted 6 posts at DZone. View Full User Profile

Apache Hadoop: The Definitive Guide Book Review

11.12.2009
| 11467 views |
  • submit to reddit
Published by: O'Reilly Media
ISBN: 0596521979

Reviewer Ratings

Relevance:
5

Readability:
4

Overall:
4

Buy it now

One Minute Bottom Line

If you are a Java developer interested in learning about how to effectively deal with massive amounts of data, MapReduce for the Cloud, and the Apache Hadoop project, then this is the book for you.  If you are a developer with Ruby or C++ experience, then you may find the code examples difficult to follow (Java 6 is required to run Hadoop).

The pluses of the book are that it’s written by an expert in the field (Tim White is an Apache Hadoop committer since 2007), it has plenty of great code examples and diagrams/figures, and it offers very good depth and technical accuracy.  The selection of the topics for the chapters are very good as well (installation, administration, subprojects, MapReduce, case studies are all covered).

The minuses are that you will have a difficult time following the example code if you are not a Java developer.  The writing style is very dry with no jokes, etc.  Also, only a subset of the Hadoop subprojects are covered.  There is no coverage (or at least no separate chapters) for Avro, Hive, Chukwa.

Review

Hadoop is a technology that is relatively unknown to most of my peer software developers however it is gaining popularity and there are increasing deployments at companies such as Amazon, Google and Yahoo.  The layout of the book is the typical reader-friendly O’Reilly format.  The accuracy, topic selection, examples and depth of the material is excellent.  This is definitely one of the more interesting computer books I have ever read.  You will learn about the Apache Hadoop project and its subprojects, Google’s MapReduce, case studies and installation directions.  Keep in mind that I have no professional experience with Hadoop.

The book begins with an introduction to Hadoop where you will learn that Doug Cutting, the creator of Apache Lucene, created Hadoop based on Nutch.  In January 2008, Hadoop was made a top-level project at Apache, demonstrating its popularity and success in the industry.  You will also learn that Hadoop is the name of the creator’s son’s toy elephant (which explains why there is an elephant on the cover of the book).

The problem domain that Hadoop is involved in is large-scale data storage and analysis (often times with clusters formed of thousands of nodes).  We learn that Hadoop provides a reliable shared storage and analysis system.  The storage is provided by Hadoop Distributed Filesystem (HDFS), and analysis by MapReduce.

The author provides an easy-to-understand example of analyzing a weather dataset using MapReduce.  This is critical to understand because Hadoop is an open-source implementation of the MapReduce algorithm made famous by Google.  The author covers the example in Java, Ruby, Python and C++.  This is great for programmers that may not know Java but have a C++ background, for example.

The coverage of the Hadoop Distributed Filesystem was very good but the example was only presented in Java.  This is most likely because the Hadoop libraries are written in Java.

There are numerous excellent screenshots and diagrams throughout the book.  For example, the screenshot of the job page on page 137 provides a couple of tables and two graphs (reduce completion and map completion) which make it very easy to understand the details for a particular Hadoop job ID.  Another good example if Figure 6-1 which visually shows how Hadoop runs a MapReduce job in a step-by-step system flow diagram.  There are numerous such diagrams throughout the book.

I did not like the fact that I was unable to find a link to download the book’s source code.  That information was apparently missing in the “using code examples” section of the preface.  I also did not like the very dry and pedantic nature of the author’s discourse throughout the book.  I understand that the material is somewhat difficult but it’s nice to have a sense of humor sometimes (check out the Head First series for some fun reads).

There is excellent and thorough coverage of Pig, HBase and ZooKeeper, three of Hadoop’s subprojects.  There is also a very good chapter on setting up a Hadoop cluster which explains why RAID configurations are not better than JBOD (Just a Bunch of Disks) configurations of HDFS.  We finally learn that Java 6 is required to run Hadoop.  And then the administering Hadoop chapter provides insightful details on running in safe mode, audit logging, and metrics.

On the downside, only a subset of the Hadoop subprojects are covered.  There is no coverage (or at least no separate chapters) for Avro, Hive, Chukwa.  There is reference to Hive in one of the case studies, however.

Overall, I would say that if you are new to Hadoop (or even just new to MapReduce or large-scale data processing in general), you will learn a lot from this book.  The case studies from Hadoop committers provide some insight into real-world use cases with Hadoop.  You will get exposure to other technologies like Hive and Nutch in the case studies.

If you are a Java developer interested in learning more about how to effectively deal with analyzing massive amounts of data, MapReduce for the Cloud and the Apache Hadoop project with its many subprojects, then this is the book for you.  If you are a developer with Ruby or C++ experience, then you may find the code examples difficult to follow (Java 6 is required to run Hadoop).

The pluses of the book are that it’s written by an expert in the field (Tim White is an Apache Hadoop committer since 2007), plenty of great code examples and diagrams/figures, very good depth and technical accuracy.  The selection of the topics for the chapters are very good as well (installation, administration, subprojects, MapReduce, case studies are all covered).

The minuses are that you will have a difficult time following the example code if you are not a Java developer.  The writing style is very dry with no jokes, etc.  Also, only a subset of the Hadoop subprojects are covered.  There is no coverage (or at least no separate chapters) for Avro, Hive, Chukwa.

Published at DZone with permission of its author, Arbi Sookazian.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)