Apache Hadoop: The Definitive Guide Book Review
ISBN: 0596521979
Reviewer Ratings
Relevance:Readability:
Overall:
Buy it now
One Minute Bottom Line
|
If you are a Java developer interested in learning about how to effectively deal with massive amounts of data, MapReduce for the Cloud, and the Apache Hadoop project, then this is the book for you. If you are a developer with Ruby or C++ experience, then you may find the code examples difficult to follow (Java 6 is required to run Hadoop). The pluses of the book are that it’s written by an expert in the field (Tim White is an Apache Hadoop committer since 2007), it has plenty of great code examples and diagrams/figures, and it offers very good depth and technical accuracy. The selection of the topics for the chapters are very good as well (installation, administration, subprojects, MapReduce, case studies are all covered). The minuses are that you will have a difficult time following the example code if you are not a Java developer. The writing style is very dry with no jokes, etc. Also, only a subset of the Hadoop subprojects are covered. There is no coverage (or at least no separate chapters) for Avro, Hive, Chukwa. |
Review
Hadoop is a technology that is relatively unknown to most of my peer
software developers however it is gaining popularity and there are increasing
deployments at companies such as Amazon, Google and Yahoo. The layout of the book is the typical
reader-friendly O’Reilly format. The
accuracy, topic selection, examples and depth of the material is
excellent. This is definitely one of the
more interesting computer books I have ever read. You will learn about the Apache Hadoop project
and its subprojects, Google’s MapReduce, case studies and installation
directions. Keep in mind that I have no
professional experience with Hadoop.
The book begins with an introduction to Hadoop where you
will learn that Doug Cutting, the creator of Apache Lucene, created Hadoop
based on Nutch. In January 2008, Hadoop
was made a top-level project at Apache, demonstrating its popularity and
success in the industry. You will also
learn that Hadoop is the name of the creator’s son’s toy elephant (which
explains why there is an elephant on the cover of the book).
The problem domain that Hadoop is involved in is large-scale
data storage and analysis (often times with clusters formed of thousands of
nodes). We learn that Hadoop provides a
reliable shared storage and analysis system.
The storage is provided by Hadoop Distributed Filesystem (HDFS), and
analysis by MapReduce.
The author provides an easy-to-understand example of
analyzing a weather dataset using MapReduce.
This is critical to understand because Hadoop is an open-source
implementation of the MapReduce algorithm made famous by Google. The author covers the example in Java, Ruby,
Python and C++. This is great for
programmers that may not know Java but have a C++ background, for example.
The coverage of the Hadoop Distributed Filesystem was very
good but the example was only presented in Java. This is most likely because the Hadoop
libraries are written in Java.
There are numerous excellent screenshots and diagrams
throughout the book. For example, the
screenshot of the job page on page 137 provides a couple of tables and two
graphs (reduce completion and map completion) which make it very easy to
understand the details for a particular Hadoop job ID. Another good example if Figure 6-1 which
visually shows how Hadoop runs a MapReduce job in a step-by-step system flow
diagram. There are numerous such
diagrams throughout the book.
I did not like the fact that I was unable to find a link to
download the book’s source code. That
information was apparently missing in the “using code examples” section of the
preface. I also did not like the very
dry and pedantic nature of the author’s discourse throughout the book. I understand that the material is somewhat
difficult but it’s nice to have a sense of humor sometimes (check out the Head
First series for some fun reads).
There is excellent and thorough coverage of Pig, HBase and
ZooKeeper, three of Hadoop’s subprojects.
There is also a very good chapter on setting up a Hadoop cluster which
explains why RAID configurations are not better than JBOD (Just a Bunch of
Disks) configurations of HDFS. We
finally learn that Java 6 is required to run Hadoop. And then the administering Hadoop chapter
provides insightful details on running in safe mode, audit logging, and
metrics.
On the downside, only a subset of the Hadoop subprojects are
covered. There is no coverage (or at
least no separate chapters) for Avro, Hive, Chukwa. There is reference to Hive in one of the case
studies, however.
Overall, I would say that if you are new to Hadoop (or even
just new to MapReduce or large-scale data processing in general), you will
learn a lot from this book. The case
studies from Hadoop committers provide some insight into real-world use cases
with Hadoop. You will get exposure to
other technologies like Hive and Nutch in the case studies.
If you are a Java developer interested in learning more
about how to effectively deal with analyzing massive amounts of data, MapReduce
for the Cloud and the Apache Hadoop project with its many subprojects, then
this is the book for you. If you are a
developer with Ruby or C++ experience, then you may find the code examples
difficult to follow (Java 6 is required to run Hadoop).
The pluses of the book are that it’s written by an expert in
the field (Tim White is an Apache Hadoop committer since 2007), plenty of great
code examples and diagrams/figures, very good depth and technical
accuracy. The selection of the topics
for the chapters are very good as well (installation, administration,
subprojects, MapReduce, case studies are all covered).
The minuses are that you will have a difficult time
following the example code if you are not a Java developer. The writing style is very dry with no jokes,
etc. Also, only a subset of the Hadoop
subprojects are covered. There is no coverage
(or at least no separate chapters) for Avro, Hive, Chukwa.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





