Book Review: Mastering Regular Expressions

reviewed by Danny R. Faught
published in the Dallas/Fort Worth Unix Users Group Newsletter, March, 2002


A review of the book Mastering Regular Expressions, Jeffrey E. F. Friedl, O'Reilly & Associates, Inc., 1997, ISBN 1-56592-257-3, 7th printing

This book has been on my reading list for a while, and I finally got around to tackling it to help with writing a section on regular expressions in a training course that I developed. What are regular expressions? According to the Perl regular expressions tutorial, "A regular expression is simply a string that describes a pattern." In a scripting language, text editor, and in a few other types of tools, these patterns can help you match text. You can pull apart pieces of the matched text for further processing, and if you'd like, you can replace parts of the text. That's a very general description of a very general tool that has uses in just about any text processing task.

I've used regular expressions ("regexes" for short) for many different things, such as automated test code that checks program output against expected results, in CGI scripts that handle forms on my web page, and in my email filter. A single regex can replace dozens of lines of code that do parsing. Regexes are very compact and powerful, but that compactness also leads many people to complain that they're difficult to understand after you've written them. Regexes are combinations of /, \, ^, $, (, and many other punctuation characters, along with ordinary letters and numbers. The most extreme example, on the last page of the book, is a 6600 character monstrosity that looks somewhat like a photo mosaic when viewed from a distance. However, the book gives some techniques for making regexes more understandable, such as embedding comments, and storing pieces of a complex regex in variables and then splicing them together.

To appreciate the book, you'll need to have some programming experience, or some other background that steels you for the task of reading dense jumbles of arcane symbols. The first three chapters are an introduction to regexes. Beginners may not make it past this section, but that's okay, because the book is worth the purchase just for the first three chapters.

The second section, chapters four and five, discusses details about regexes. Luckily, the author doesn't dive into the computational theories behind regular expressions (if he did, we'd find that our regular expressions aren't so regular after all). Even so, this is pretty deep stuff. The third section, chapters six and seven, give details about using regexes in particular tools. Chapter seven, on Perl, comprises a full third of the book.

The scripting language that's the undisputed regex champion is my favorite language, Perl. The book focuses on Perl. However, there's also reasonable coverage of other tools such as egrep, awk, tcl, GNU Emacs, plus some mentions of vi, sed, lex, Python, and Expect. There are many differences in the regex implementation among these tools, some obvious and others very subtle. This book is the best reference for sorting through the differences. This is especially useful if you're getting confused when using more than one of these tools at a time, or if you learned regexes in one tool and you're getting surprising results using another.

The author tells me that he's working on an update to the book because some aspects of it are dated. "The real meat of the book -- Chapters 4 and 5 -- is as valid and useful as ever, but some of the specifics have changed. Python and TCL, for example, have changed their engines to be more Perl like," said Friedl. This means that much of the Perl-specific information will be shifted to the parts of the book are aren't tool-specific.

Even accomplished script hackers are usually humbled by the masterful and comprehensive treatment of regexes that this book provides. I have been using regexes for years, and I was surprised, not so much by the advanced features that I mostly knew were there and just hadn't had the need to learn yet, but actually by the gotchas that exist even in simple expressions.

Like most O'Reilly books, this book is associated with the type of animal that is shown on the cover. The author says on his web page that this is the "Hip Owls Book," distinguishing his book from the other O'Reilly owl book, Learning the UNIX Operating System.

Tracking down further information from the author was not easy. All three URLs given in the appendix are defunct. Further searching turned up a pile of additional URLs that are also mostly defunct. Finally I found what looks like a home page for the book at http://public.yahoo.com/~jfriedl/regex/. This page includes a comprehensive list of errata, and an updated "About the Author" blurb, which mentions that Friedl went to work for Yahoo. I also found an interview of Friedl at amazon.com, done in 1997.

The bottom line? Beginners who have some appreciation for program code will benefit from reading the first three chapters and using the rest of the book as a reference when needed. Experts who read the book all the way through will come away with their tails between their legs, significantly smarter about all the things that regular expressions can do, and quite a bit more wary about how they can get themselves into trouble.


Copyright 2002, Danny R. Faught
-> back to my home page