Book Review: Hadoop - the Definitive Guide (3rd ed.)
One Minute Bottom Line
|Overall the the writing style is accessible though bit verbose at times. The logical progression of content is ok for the most part though some sections are deep inside chapters making it bit hard to find (e.g: A good example would be the section on Cascading). Also I think the section on writing a mapreduce application is bit too deep into the book. However with those aside I think the book is a good reference, capturing new developments of Hadoop and its ecosystem of projects sufficiently across its breadth.|
ReviewHadoop has now become the de facto standard for large scale data analytics. With all the rage behind "Big Data" and "NoSQL" Hadoop is well positioned as the framework of choice for analyzing the large scale volumes of data handled by these systems. Over the years Hadoop has matured considerably with lot of enhancements and features and the amount of projects in Hadoop ecosystem is simply impressive. So the requirement for an authoritative reference on all things Hadoop is higher than ever, which "Hadoop : The Definitive Guide" now in its third edition aims to fulfill.
The first chapter of the book gives a good background on the history and the need for a project like Hadoop with real world examples. I found the comparisons with existing systems (Grid, RDBMS etc.) illuminating which gives the reader an idea of what Hadoop is and is not. A comparison between Dryad, a project somewhat similar in objectives to Hadoop (now defunct?) would have added more perspective as well, as I feel.
One of the first things that immediately puts off newcomers to Hadoop is its complex release version history. Finding the way through the myriad release branches is perhaps more daunting than getting started with a Hadoop 'Hello World'. The section on Hadoop releases is a welcome addition from the second Edition version of the book as far as I am concerned though I would have liked a nice graphic on Hadoop release tree (or is it graph?) which would have better complemented this section.
The mapreduce example and streaming examples on the second chapter are pretty accessible with a good comparison which explains why Hadoop does a better job than a plain old awk script when it comes to handling a large dataset efficiently. However the highlight here for me is the comparison of old and new Hadoop APIs which can be a source of confusion for many.
The section on HDFS contains introductions new HDFS features on HDFS federation and High Availability although an depth treatment of these topics with practical aspects has been left out. A light introduction to Flume and Sqoop has been added as well. Sqoop is detailed in depth in a later section of the book.
For 'learn it by doing' types the section on developing a mapreduce application is really a gem not only in that it gives code examples but also since it gives instructions on setting up the development environment which most of the time goes unmentioned in tutorials. However I felt the section is bit long and verbose for a person wanting to get the basic concepts quickly. The tutorial is developed incrementally which now includes a section on writing unit tests based on MRUnit. This section covers the entire lifecycle of a mapreduce application from development to running it in production which includes debugging, profiling and tuning. The section on Oozie has been expanded to cover the deployment of the application developed in the tutorial as a workflow. For the impatient needing to get hands on quickly, sections on HDFS and IO can be safely skipped for later reference to get to this chapter from first couple of chapters, as I feel.
A new section on Hadoop 2.0 based on YARN has been added which describes how things work under the hood with the new system pretty comprehensively with illustrative diagrams, making this a good addition specially since there are very little resources accessible online detailing the new Hadoop architecture. I also found the section on Hadoop failure modes both in old Hadoop systems useful so that I was able to fully appreciate the reasons behind the major architectural changes done with Hadoop 2.0. However practical application development examples based on YARN has been left out for this edition.
Sections on MapReduce types (Input/OutputFormat etc.) and features retain useful information from the earlier edition and alongside with some additions comparing old and new mapreduce APIs which is quite useful. Hadoop administration sections contains new sections on administering YARN based systems as well as a newly added section on Apache Whirr. Chapters on Hadoop related projects (Pig, Hive, Sqoop etc.) also cover information in sufficient details to get started and has kept up with the developments of newer versions of the projects. However a notable omission is an introductory section on Apache Mahout which I think caters for a interesting application area of Hadoop. I also would have liked a section on Apache Cassandra specially given the amount of traction it's getting in the NoSQL space with most of the above projects supporting Cassandra integration (Pig, Hive, Cascading 2.0 etc.). As for the use cases, a Hadoop use case in a multi tenanted environment (I think this present a new set of challenges) would have been a good addition now that cloud based analytics offerings are gaining some traction.
Overall the the writing style is accessible though bit verbose at times. The logical progression of content is ok for the most part though some sections are deep inside chapters making it bit hard to find (e.g: A good example would be the section on Cascading). Also I think the section on writing a mapreduce application is bit too deep into the book. However with those aside I think the book is a good reference, capturing new developments of Hadoop and its ecosystem of projects sufficiently across its breadth.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)