A Review of DevOps Troubleshooting: Linux Server Best Practice
Buy it now
One Minute Bottom Line
|A must read for new Linux system administrator and anyone from the development side responsible for keeping a Linux server running.|
I may be the perfect audience for this book. I run a small consultancy, and consider myself a developer/analyst first, but I am also the system administrator, DBA and ops team. To quote President Truman, The buck stops here.” I have run at least one Linux server (RH/Fedora, RHEL, Ubuntu and most recently a pair of Raspberry Pi’s running Raspbian) since the late 1990’s, starting with RH3. During that time I have solved a myriad of software issues and hardware failures. I have solved every one of these challenges, but every hour spent doing this is an hour I am not writing new bill paying code. So finding better and faster ways to troubleshot and correct server issues improves the bottom line, and this book provides the knowledge to do just that.
The book can be divided into three sections general troubleshooting best practices (chapter 1), how to solve a specific server or system issue (slow servers , boot failures, disk issues, network issues, DNS failures, email problems, web server issues and database performance problems – chapters 2-9), and diagnosing hardware failures (chapter 10).
Section 1 – Best Practices
The tips contained in chapter 1 (e.g. the importance of communication, documenting problems and their solutions etc.), may be common knowledge and common sense, to most system administrators. But, for those of us new to the operations role merit their inclusion. For example a developer taking on the role of system administrator (especially one who is new to Linux) may not understand that most changes or problems do not require a reboot.
Section 2 –Specific Server Problems
Each chapter in section 2 begins with an explanation of how the server or system works and what the most common failures are. This is followed up with the steps needed to diagnose and correct these problems. While none of these chapters is exhaustive. The author covers the most common problems, and I think he finds the right balance between brevity and detail – considering several of the chapters could warrant a book of their own (e.g. MySQL Troubleshooting: What To Do When Queries Don't Work). Brevity is also maintained by making heavy use of previously introduced material (when a web server is determined to be unreachable, the reader is directed to the network and DNS troubleshooting chapters).
There is almost always more than one way and more than one tool to do anything in Linux and troubleshooting is no different. The author focuses on basic tools and technique; tools that most distros will include by default – I think this is the right approach given the audience. In an emergency these tools are likely already present and won’t need to be installed. While 3rd party packages and monitoring/alerting tools exist and offer other benefits, the intended audience may not have the resources needed to install and maintain these tools.
The use of the load average, and top to determine whether a server is CPU, RAM or I/O bound was extremely helpful, and something I was able to immediately apply when tracking down an issue with my Raspberry Pi driven dashboard.
Section 3 – Hardware Failures
While this section is far from comprehensive, it does cover the major causes of failure (hard drives, RAM, network cards, cooling and power supplies). This is again a chapter that could warrant an entire book of its own. I have experienced almost all of these problems at one timeor another, and appreciated the clear simple diagnostic steps presented here. Since hardware failures are often the hardest to diagnose (manifesting as random or intermittent errors and are often blamed on software). I think one addition could have made this book even better. I would have liked to see a troubleshooting flowchart added to each chapter. This would have made diagnosing a problem much quicker, and avoided the need to jump to other sections of the book (e.g. From DNS to the Network Section) while trying to solve a problem, and reading through several pages to find the relevant commands.
In spite of my opinion on the use or misuse of the term “devops”, I would like to see a series of follow up titles covering topics, especially directed to smaller teams that may not have a dedicated ops team, system administrator or DBA, covering:
- Monitoring and alerting - how do I identify potential problems early and prevent failures and outages before they cause a failure or bring the server to its knees.
- Performance testing and capacity planning – When and how do I scale (up or out), and how do I identify potential issues before the app hits production.
- Change management – Including a range of solutions that fit small companies and teams.
- Configuration management – how to implement multiple environments.
This book is unique. I don’t know of any other book that collects all of this info in one place. While much of the info is available elsewhere on the web, it doesn’t do you much good when the server won’t boot or DNS or the network fails, and at 240 pages it can easily be read on a cross country flight. While it may not offer much to full times ops teams and system administrators; for those of us who run a small number of servers and are not full time System administrators it provides excellent value. It has easily repaid my investment - time - (I was given an electronic version to review) and earned a prominent place on my bookshelf.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)