An economy of scale exists for programming tools just like end-user software. Tools with extremely broad application are typically sold over-the-counter. More specialized tools are commonly developed and distributed within a company, and still more specialized tools may be developed by and for a single group of programmers, or perhaps even a single programmer.
The typical software organization uses tools developed at all levels of economy. The most common and well-defined problems are usually handled by commercial or public-domain tools or through large-scale internal development efforts. But shrink- wrapped solutions are usually not sufficient by themselves. Specialized tooling is often required when gaps are found between two such primary tools, or between the primary tools and the humans who must work with them. Typically, filling these gaps involves reorganizing, filtering, or otherwise processing output from some tool or process, or manipulating a tool in some fashion. These specialized tools can be fairly large and implement significant functionality or they can be small and simple, the product of one engineer. This range of complexity -- between the massive, shrink-wrapped product and the three-line shell script -- is the focus of this paper.
Some typical examples of tooling at this level are:
o Report generation and other data munging
Raw data from tools that are part of build, test, benchmark, and other processes often needs to be processed before it is fit for human consumption. Even if some report generating capacity is built into the tools, its a rarity when authors of a tool are able to completely anticipate the needs of their users.
o Tool integration
Often the raw data from one tool is used as input to another. Rarely are these fits seamless. Usually, some additional tooling is required to make the square peg fit into the round hole.
o Setup and verification
Tools are often written to check for or create certain conditions that
are necessary before some process can be initiated.
Typically -- either through policy or by custom -- an organization works with a relatively small set of tools. In order to meet the organization's broad tooling needs, these tools must be widely available, widely applicable, and must harmonize well with the organization's other tools and development environment.
Perl is a long-standing member the toolbox of our organization that we are working with.1 Here is a brief assessment of Perl's strengths in light of those characteristics.
o Perl is a Living Language
Perl is a popular, thriving language whose user base continues to grow year after year. This kind of support ensures that Perl is and will continue to be widely available. The Perl source compiles with minimal effort on most Unix(r) systems as well as VMS, QNX, Amiga DOS, OS/2, and Plan 9. Ports are available for MacOS, Windows 3.1, Windows95, and WindowsNT. It is packaged with many Unix(r) OSes, and is commercially supported by several companies.
Perl's wide grassroots support also ensures that there is a large knowledge base to draw from. Many books have been published on Perl and it is prominently featured in many others. Many web sites offer excellent sources of information.
Another effect of Perl's popularity is the availability of Perl libraries and modules. CPAN -- the Comprehensive Perl Archive Network, Perl's distributed software archive -- contains hundreds of actively supported modules. These modules implement data structures and development tools, provide interfaces to various network protocols, allow transparent access to commercial databases, automate web and systems administration tasks, and serve many other useful purposes.
o Perl is an Eclectic Language
Perl draws heavily from C, sed, awk, sh, and other popular utilities. This has several important implications. First, for engineers familiar with those tools, Perl's learning curve is smaller than it would be otherwise. Such cultural congruence is crucial when integrating a new general-purpose tool into an existing development environment.
Another implication is that Perl can often supplant traditional utilities. Of course, it can't completely supplant them all. Each has a niche in which it is very effective. Nevertheless, Perl generally provides a solid replacement for many standard tools, and will therefore have a kind of unifying effect on a tool base that is extremely important from a maintenance perspective.
Perl also supports multiple programming paradigms. Perl contains all of the facilities necessary to support traditional structured programming. Perl 5 added many object-oriented constructs to the language as well as support for certain functional programming constructs such as closures.
o Perl is a Flexible and Intelligent Language
Perl allows programming at various levels of rigor and rigidity. It provides facilities for enforcing policies such as parameter checking in functions, but does not force their use. When such facilities are not used, Perl tries to Do The Right Thing via heuristics and intelligent defaults. Very often it succeeds.
This flexibility makes Perl very good for rapid development, prototyping, and personal tooling on one hand, and for writing highly important applications on the other.
Now that we have reviewed Perl's strengths, let us examine its weaknesses.
o Perl is a Living Language
The fact that Perl is a living language means that it is, to some degree, a moving target. This is not as true as it used to be, though. Due at least in part to the relative ease of extending and transforming the language, Perl 5 will be around in its current form for quite some time.
Another item to consider is the fact that the modules that are not bundled with Perl are also a moving target -- much more so than Perl itself -- and present all of the challenges that come with dynamically loaded versioned libraries. On the other hand, this is the way that more and more software is written, so Perl is not alone in this respect.
o Perl is an Eclectic Language
In general, Perl's eclecticism can require a greater burden of knowledge on the part of the programmer. It is important that the programmer know all of the features at her disposal, so that she can make good design decisions. And often there are a great many to choose from.
In terms of programming paradigms, Perl's eclecticism implies that no secondary paradigm is completely supported. Many object-oriented features that are built into other OO languages must be written explicitly in Perl, for instance.
o Perl is a Flexible and Intelligent Language
The fact that Perl is often able to Do The Right Thing based on incomplete information implies that sometimes it Does The Wrong Thing. And though Perl does provide some tools for rigorous programming, it definitely prefers more relaxed styles, and this tendency does bias it toward certain tasks and away from others.
Our experience demonstrates that Perl's strengths greatly outweigh its
weaknesses in many environments. This is especially true for the crafting
of tools in a software development environment. The following section gives
some suggestions that allow you to maximize its strengths and minimize
its problems.
But one must use object oriented constructs knowledgeably or they can reduce both performance and clarity. Here are some tips for using Perl 5 in an object-oriented fashion:
o Define classes at an appropriate level of abstraction
This is true when writing code in any object-oriented language. The point here is that what is "appropriate" in other languages may not be appropriate for Perl programs. When designing class hierarchies, it is important to keep Perl's strengths in mind.
For instance: it may not be appropriate to use an object to model small records collected in a file. While entirely acceptable in C++, such an approach does not suit Perl. Its mechanisms are probably too heavyweight to work well at that level of granularity. For a traditional hash-based object, the overhead involved in hash accesses and method calls are prohibitive if all you are doing is getting and setting fields in the object. Storing the records in this fashion might take five times as much memory as storing the records as lines of text.
If the tasks for which you need the records take a great deal of time or use lots of memory, or if you just happen to have a surplus of either, this may not matter. (For instance, if the files contain names of remote sites that you need to transfer data to, it probably won't matter that you could be reading the records more efficiently; the time it takes to transfer files is much more significant.) Further, certain approaches to designing the class could help reduce memory use or lessen the performance penalty. Even so, the value of such undertakings is dubious.
But it might be worthwhile to model the entire collection as a Perl object. The object's methods define operations that are commonly performed on the set -- searching, perhaps, or sorting. This allows the programmer to make the best use of Perl's facilities, many of which are geared toward manipulating arrays of strings. In addition, the object could provide low-cost methods to access the individual entries or their parts.
o Classes must be very well documented
Perl is no language for despots. Perl's approach to object-oriented programming is consistent with this. There is no automatic parameter type-checking for methods (though there is for free subroutines). There is no built-in facility for creating private variables in an object. There is no reasonable way to prevent a user from calling a particular method.
The issue goes beyond preventing errors or lessening the effects of carelessness on the part of Perl programmers. Putting constraints on a user's access to functions or data is a way of defining the interface to the class ("these are the functions you will need in order to use this object"). It makes explicit a certain part of the contract between the user of the class and its author ("thou shalt not access C::foo").
Perl relies on documentation to fill these roles. Documentation is the only sure way for users to know which methods they should call and which they shouldn't, what parameters those methods take, and what object data should be left alone.
o Classes should be conventional
This is not to say that your classes shouldn't do radical things, merely that they should conduct their radical business with some degree of decorum. Convention is another tool the Perl community uses to help lubricate relations between author and user. Hence, your classes should follow these conventions unless there is a very good reason not to. Many conventions are simple and are present purely for the user's comfort:
- An object's constructor should be named "new".
- Use leading underscores to indicate internal methods and "private"
data.
Others are more critical because they affect how well the code works in an object- oriented milieu:
-Don't access class data directly from an object method.
- Don't attempt to explicitly verify the type of an object reference
Ignoring such conventions prevents your classes from being properly
inheritable or from functioning in other contexts. The Blue Camel book2
lists many such conventions, as does Tom Christiansen's Perl tutorial,
which can be accessed in later Perl distributions via "man perltoot". The
Perl Modules File, which can be found at the Perl Language Home Page listed
in the References section, has information on conventions used when writing
Perl modules.
In many cases, solving efficiency problems is a matter of knowing what facilities Perl offers. I once saw a discussion on Usenet that involved a bit of code like this:
for ( 0 .. $number ) { $array[$_] = int(rand 500) + 1; }In straightforward situations like this, there isn't much room for optimization in the normal sense of the word, but efficiency improvements are possible. Three or four were suggested in follow-up posts, but the most effective improvement (on most platforms) is simply to tell Perl that it can "use integer" -- that is, represent the numbers inside a given block as integers rather than floating point numbers.
@array = sort bynum @array;
sub bynum { $a <=> $b; }
Perl's motto ("There's more than one way to do it") is true for even simple problems. It follows, then, that there are exponentially many ways to perform complex tasks. The more you know about your choices at each step on the way, the greater your chances of finding a good solution.
o Learn a bit about regular expressions
Writing regular expressions is sometimes a tricky business. Perl's regular expression engine is fairly smart and performs many useful optimizations automatically, but nothing beats knowing a bit about how regular expression engines work. If you find that your code relies heavily on regular expressions, a little study will definitely pay off. Mastering Regular Expressions by Jeffery Friedl is an excellent source. See the Resources section for bibliographic information.
o Use the right tool in the tool for the tool...
There are certain tasks for which Perl may not be the most optimal tool. Sometimes a lower-level programming language may be best for the task. At other times, you may need to access libraries written in C or C++.
This does not imply that you should rule out the possibility of using Perl, particularly if Perl has features relevant to your needs. You might consider (a) extending Perl by creating a module that integrates Perl with compiled C/C++ code, or (b) embedding Perl into your C/C++ application by making calls directly to the Perl API. Tools are available to help you perform both of these tasks. See the Resources section for details.
o Learn a little about Perl internals
It is not necessary to become intimate with the Perl source base in
order to write efficient Perl, but sometimes it helps to know what is going
on underneath the hood. This is especially true if you are trying to embed
or extend Perl. Several good introductions to Perl internals exist, including
the last few chapters in Advanced Perl Programming3.
- Complex regular expressions - References and heterogeneous data types
But reading Perl4 does not have to be a painful experience. Perl offers tools to assist in writing readable source. For instance:
- Consider constructing large regular expressions from variables containing smaller ones. Regular expressions can be largely self- documenting if these variables have mnemonic names. You can use t/ov to ensure that the expressions are compiled only once, so that there are no significant performance penalties.
- If that is not good enough, consider using t/xv in your regular expressions. This allows you to insert whitespace and comments into the expression.
- Perl has some flexibility with regard to punctuation. Use braces and parentheses liberally when it is important to be unambiguous. Drop them when they are unnecessary. This policy is especially recommended when working with data structures that use references. There's no reason to say anything like this
if ( ${@{$foo->{"bar"}}}[0] ) { $i = 0; }or even
if ( $foo->{"bar"}->[0] ) { $i = 0; }when
$i = 0 if $foo->{bar}[0];will do.
Perl has many facilities that make parsing files very easy. Consider a simple case in which we need to read a setup file for a tool we are developing. Suppose all we need is the ability to attach a value to a variable. This is a small personal utility, so to make things easy on ourselves, we decide that the setup file can have one declaration per line, that the variables must be separated from values by an equals sign, and that we will ignore any lines that do not fit that criteria. We want to parse files that look like this:
foo = 2This kind of task is trivial in Perl. We might write this portion of the program as follows:
bar = none
open F, $fn or die "Cannot open setup file $fn: $!";Now let's say that our script has found a niche and is being used by more and more people in our group. We are now mildly embarrassed by our hackery, and wish to make our script more robust. First, we want it to report syntax errors. We would like it to only accept a fixed list of keywords, and we want it to warn us if an assignment is made to the same keyword more than once in the setup file:
while () {
$opts{$1} = $2 if /^\s*(\w+)\s*=\s*(\S+)\s*$/;
}
my (%dict,%opts,@kws);
@kws = qw (
autoindent mesg slowopen
autoprint modelines tabstop
autowrite number taglength );
@dict{@kws} = ();
open F, $fn or die "Cannot open setup file $fn";
while () {
if ( /^\s*(\w+)\s*=\s*(\S+)\s*$/o ) {
print STDERR "Warning! $1 redefined!\n" if exists $opts{$1};
if ( exists $dict{$1} ) {
$opts{$1} = $2;
} else {
print STDERR "Error: unknown var \"$1\" on line $..\n";
}
}
}
Most people are well aware of Perl's aptitude at tasks like this.
But what about the next level of complexity? Presume we need to read a
file format in which whitespace is generally not significant and the syntactical
structures are more complex than in our previous examples. This new file
format is designed to associate attributes with filenames, as follows:
/src/util/foo.c: installed == 1; entries < 150; err_lines == 2, 17-20 ; signature != /foobar/; machines == /\.rsn\.hp\.com$/, "fuzz-e.logic.com", angst;The grammar for such a format might be expressed as follows:
:= : := | := ; := , | := | | | := | -Is Perl the right tool to choose? It depends. At this level of complexity, a parser written using lex and yacc or hand-rolled in C would probably be significantly faster. But if for our purposes programmer efficiency and maintainability are more important, and if the rest of the program -- what is actually done with the information from the parsed file -- is best written in Perl, then the balance might tip toward Perl if reasonable solutions are possible in Perl.
In fact, several programmer-efficient and highly maintainable solutions are possible in Perl. Here is one approach:
- Define all of the tokens as regular expression strings global to the parser's package. The tokens should have parentheses surrounding the portions with semantic importance:
my $ident = '([A-Za-z][A-Za-z0-9-_]*)'; my $op = '([!=<>]=|<|>)';
- Declare a global structure that contains a reference to the string being parsed and the index of our current location in the string, as well as an array to hold data extracted from the token regular expressions.
my $p_rec = {
SRC => undef,
PREV => 0,
V => [],
};
- Write a small helper function "&match" that:
- Accepts a regular expression as a parameter - Checks to see if the regexp given is next in the file, using Perl's \G regexp assertion - Uses Perl's pos function to keep track of where we are in the file - Stores data captured by the parentheses
Here's a simple example:
sub match {
my ($p_rec, $re) = @_;
pos ${$p_rec->{SRC}} = $p_rec->{PREV};
if ( ${$p_rec->{SRC}} =~ m/\G$re/g ) {
$p_rec->{PREV} = pos ${$p_rec->{SRC}};
$p_rec->{V} = [ $1,$2,$3,$4,$5 ];
return 1;
} else {
return 0;
}
}
- Write a second small helper function "&handle_err" which:
- Takes a message as a parameter, indicating what token was expected - Calculates the line number given the current position in the string - Prints an error message, along with a small context string, e.g.:
Error on line 2 near "var1 === foo": expected "==" or "!=".
- Terminates the program.
Armed with this, writing the parsing code is straightforward. The code itself is highly maintainable. Here's a sample:
sub _parse_predicate {
my $state = 0;
my $decl = { TYPE => "DECL", VAL => [], };
for (;;) {
if ( $state == 0 ) {
my $type;
if ( &match($p_rec,$re_re) ) { $type = "RE"; }
elsif ( &match($p_rec,$re_quoted) ) { $type = "QUOTED"; }
elsif ( &match($p_rec,$re_number) ) { $type = "NUMBER"; }
elsif ( &match($p_rec,$re_ident) ) { $type = "IDENT"; }
else { &parse_err($p_rec,"literal or regexp"); }
push @{$decl->{VAL}}, {TYPE => $type, ITEM => $p_rec->{V}[0],};
$state++;
} else {
if ( &match($p_rec,$re_comma) ) { $state = 0; }
else { return $decl; }
}
}
}
Perl's regular expression facilities are powerful enough that the details
of tokenization are hidden from us with relatively little effort.
It is, however, slower than it might be. There are several options we might consider to speed the program up. We could reduce the amount of alternation built into the grammar, for instance. But the real bottleneck is the function "&match". Each time it is called, the regular expression code must interpolate the string passed in as a parameter, compile the regular expression, and then attempt the match.
Ideally, we would like to compile the regular expressions once, at the beginning of the program, and pass those compiled expressions to &match. But in Perl, regular expressions are not (yet!) "first-class" objects which can be manipulated in this fashion. Perl does provide the "/o" modifier, which allows the user to tell Perl that the variables in a particular expression will not change, so that Perl compiles the expression only once. But in our case, the variable $re in &match does change, so "/o" doesn't help us here.
One alternative might be to attempt a match explicitly at each spot where &match is called in our current code. Unfortunately, this involves building in the code to manipulate the variable that keeps track of the matching position, which is especially annoying in the case of alternations, e.g.:
elsif ( pos $$source = $last and $$source =~ /(-?(?:\d*\.)?\d+)/ ) {
$last = pos $$source;
$type = "NUMBER";
}
While much faster, this is not nearly as clean as the original code, even
if we were to use our $re_number variable and "/o" in place of the text
in the regular expression.
But as is so often the case, Perl's resources are vast enough that we can find an acceptable compromise. Instead of calling a generic match function, we can use Perl's eval() to create specific match functions for each token (or one generic function with a case statement which has a case for each token). Instead of declaring string variables to represent the tokens, we could use these functions instead:
my $match_ident = &gen_mfunc('([A-Za-z][A-Za-z0-9-_]*)');
my $match_op = &gen_mfunc('([!=<>]=|<|>)');
We would invoke these functions as follows:
if ( &$match_re($p_rec) ) { $type = "RE"; }
elsif ( &$match_quoted($p_rec) ) { $type = "QUOTED"; }
...
This solution is just as clean as our original idea, but offers performance
more like that using explicit matches with fixed strings, and all without
substantial changes to the code.
Have we reached the limits of Perl's ability yet? No. Although for substantially
more complex grammars we might continue to entertain the possibility of
using some other tool to implement the program, Perl has many resources
that continue to make it a contender. For instance, modules such as Parse::Lex
and Parse::RecDescent are available that make specifying complex parsers
in Perl easier. On the horizon is a yacc back-end, which would generate
parsers in c, compile them, and link the resulting object files into a
Perl module, thereby giving Perl users the best of both worlds.
If you do not have Perl 5 on your system, setting it up is not terribly difficult. Be careful: there are many paths that could have been used to install Perl, so if you don't find it in one place, look elsewhere, such as /usr/bin, /usr/local/bin, /usr/contrib/bin, and /opt/perl5/bin. The standard installation path for Perl is /opt/perl5.
You'll find the latest Perl 5 sources at:
http://www.perl.com/CPAN/src/5.0/latest.tar.gz.
After you download the sources, follow the steps below. It took all of 12 minutes on an HP V-2200 server.
Sometimes you will see a test failure or two. For example, the comp.cpp test fails in Perl 5.004_04 on HP-UX 11.0 - Perl has not yet caught up with some compiler changes. In this case, you can get a bogus error in the rare case that you use Perl's -P flag. Check with the comp.lang.perl.misc newsgroup if you get a test failure that you cannot resolve.
At this point, you might want to install some extra modules from CPAN, the Comprehensive Perl Archive Network. The main URL for CPAN is http://www.perl.com/CPAN/. Always check here before developing a new module. You'll find a vast array of modules that is constantly growing. Beware, though, that many of the modules are alpha quality or worse, and they may not be ported to HP- UX. Read the fine print in the Perl 5 Module List at http://www.perl.com/CPAN/modules/00modlist.long.html. If you do install extra modules, be careful not to delete them when you install a newer version of Perl.
So why is there no discussion of Perl 4, or Perl 6 for that matter?
Perl 5 seems to be a creature of its own; a derivative of Perl 4 as much
as C++ is derived from C. Yes, Perl 4 is still available, and Perl 4 scripts
that aren't quite compatible with Perl 5 probably still exist. But use
Perl 5 if at all possible. Unless, of course, you're reading this in the
distant future, and a Perl 6 or later is available. But Perl 6 is going
to be a long time coming. The core parts of Perl have been simplified and
much has moved into the modules, so many changes, extensions, and modifications
can be made without requiring new releases of the Perl core.
Particular items of interest include:
- http://genome-www.stanford.edu/perlOOP/
contains some information on writing Object-Oriented Perl
- http://www.perl.org is The Perl
Institute's home page
Other valuable O'Reilly books include:
Advanced Perl Programming, Siram Srinivasan: This book contains informative discussions related to object-oriented programming in Perl, extending and embedding Perl, and it provides an excellent introduction to Perl internals.
Mastering Regular Expressions, Jeffrey Friedl: An excellent source of information on regular expressions in general. Contains valuable information on writing regular expressions in Perl specifically.
Other Information
For a more detailed but somewhat dated discussion of some of the topics in this paper, see "Tester's Toolbox: Using Perl Scripts" by Danny Faught, Software QA Magazine, Vol. 2, No. 3, 1995.
----------------------------------------------------------------------
Footnotes:
1) Integration, Test, and Delivery at Hewlett-Packard in Richardson, TX.
2) A common name for Programming Perl, 2nd edition. See the References section for more details.
4) That is to say, treading other people's Perl programs. One's own Perl is always entirely comprehensible.