Open Testware Reviews
Technology Bulletin: Web Link Checkers
Copyright 2004 by Tejas Software
Consulting - All rights reserved.
Reviewed: 2004-November-12
Testingfaqs.org category: Static Analysis Tools
If you are involved with maintaining or testing a web site, you've
encountered broken hyperlinks before. Web sites are susceptible to
"link
rot," especially if they link to external sites. To help us
find these broken links, we can use link checkers that systematically
crawl our web sites and report any problems they find.
Most commercial tools that specialize in web testing include a link
checker feature. And fortunately, there are also many open source and
freeware choices for link checkers. I'll highlight a few of them here.
I experimented with quite a few free link checkers, many of which
were difficult to install. I narrowed my list down to those that are
easiest to install and use. Some tools only check a single page at a
time, and none of them made the short list because most people would
want to recursively check an entire site.
Xenu's
Link Sleuth is a closed source freeware program for Windows that has
several useful features. I looked at version 1.2f. Xenu was easy to
install and run. It has a GUI interface, but no command line interface,
which limits your ability to run the tool automatically.
Some of the features include:
- Can prompt the user for a password to get into part of the site
that require authentication.
- Multi-threaded, checking as many as 100 URLs concurrently.
- Can be configured to a maximum depth away from the start page.
- Checks ftp and gopher sites.
Be aware that the report page it generates shows banner ads. There is
no help text, but a good deal of slightly disorganized documentation on
the web site plus links to two unofficial manuals.
Another user I contacted tells me that Xenu misbehaved when he ran it
against a very large web site. You may need to segment large sites into
smaller chunks, reduce the number of threads, and refer to the FAQ
entries that discuss other ways to reduce resource usage.
The organization that defines standards for the World Wide Web also
provides a variety of tools for checking web pages for problems,
including a link checker. This tool is available as a web-hosted
service, and the Perl source code is also available to download so you
can run it locally from the command line.
As you might expect for a tool sponsored by a standards body, the W3C
Link Checker is very thorough. Unlike many other hosted link checking
tools, it can search your site recursively up to a specified recursion
limit. It will prompt for any necessary passwords needed to get into
protected parts of the site, though the dialog confusingly indicates
that the request is for the W3C web site rather than the site
requesting authentication. All pages that are password-protected get a
head-scratching message, "The link is not public. You'd better specify
it."
I don't understand many of the things that the tool complains about.
For example, it sometimes complains that an anchor has been duplicated
dozens of times, but when I look in the html source for my page I don't
see the problem. Perhaps I should take the tool's advice to use the W3C
HTML validator first, but it's enough of a hassle just keeping links up
to date. It's nice to know that's it's checking the anchors along with
the links.
The W3C tool checks the robots.txt file on sites before verifying any
links on the site. Sites will set up this file if they don't want
automated web spiders like search engines to crawl their site, and
apparently they apply to link checkers, too. So you may get some
messages that say "The link was not checked due to robots exclusion
rules. Check the link manually." You know that at least part of the
target site is working, since it was able to deliver the robots.txt
file. I suppose it's nice that the tool is a good citizen, though the
scripts that ignore the robots.txt file can be more thorough. The other
tools mentioned here don't check robots.txt. You could probably
configure your robots.txt file to allow the W3C Link Checker but not
other tools.
This link checker is not very fast. Not only is it single-threaded, but
it also imposes a 1-second delay between each link it checks. I imagine
that this reduces the load on their web server, but I would probably
try to remove this feature if I downloaded the tool myself and decided
to use it long-term.
Checkbot is a fairly straightforward link checker, implemented in Perl.
You will likely need to install a few additional Perl modules before
you can install it. I successfully installed it on Windows XP and
Debian Linux 3.0. It has a command-line interface and writes its output
into files in html format. It cannot authenticate its way past pages
that ask for a password.
It can be a bit frustrating waiting for the script to finish if you
don't have any visible blinky lights to show network traffic. You can
use the --verbose option to watch the status. The results file is
sorted into sections, so it's difficult to track the progress by
watching the file grow from within.
Checkbot seems to be single-threaded, but it doesn't force any delays
between URLs. You cannot limit the recursion depth; it will always
spider the entire site.
Checky Plug is a plugin for Netscape, Mozilla, and Firefox that feeds
the page you're currently viewing into any of a long list of hosted web
page analyzers, including three different link checkers. It can launch
the W3C link checker, the WDG Link Valet, and the WebThing Link Valet,
which looks just like the WDG Link Valet.
All it really does is enters the URL of the page you're currently
viewing into the appropriate form for the online service you choose.
But it can be useful to explore the broad range of
analyzers that it makes available, which includes html validators, a
meta tag
analyzer, accessibility analyzers, and more. If you don't have enough
complaints about your web page now, these tools give you more than you
ever wanted to hear.
The bottom line
I like Xenu for its impressive speed, and I'll use it to shake out a
long list of rotted links on my sites. I'm concerned that it might
break down on a really large site, though. Checkbot may be a good place
to start if you want a command line interface and don't need to access
password-protected pages. Step up to the W3C Link Checker if you want
to be very thorough and if you don't mind messing with robots.txt
rules.
Use Checky Plug for instant gratification to check pages you're
currently viewing, and to explore lots of other types of checking you
can do against your web pages.
None of these tools can parse Javascript code. As far as I can tell,
you'll need a commercial tool to check links generated by Javascript.
If someone were motivated to do it, it should be feasible to integrate
an open source Javascript interpreter into one of these link checkers.
All of these tools are vulnerable to false hits on sites that are
temporarily down. I don't really want to hear about those. Ideally, I'd
like a tool that monitors my sites and only reports broken links that
have been broken for a specified number of days in a row.