Open Testware Reviews

Technology Bulletin: Web Link Checkers

Copyright 2004 by Tejas Software Consulting - All rights reserved.

Reviewed: 2004-November-12
Testingfaqs.org category: Static Analysis Tools

If you are involved with maintaining or testing a web site, you've encountered broken hyperlinks before. Web sites are susceptible to "link rot," especially if they link to external sites. To help us find these broken links, we can use link checkers that systematically crawl our web sites and report any problems they find.

Most commercial tools that specialize in web testing include a link checker feature. And fortunately, there are also many open source and freeware choices for link checkers. I'll highlight a few of them here.

I experimented with quite a few free link checkers, many of which were difficult to install. I narrowed my list down to those that are easiest to install and use. Some tools only check a single page at a time, and none of them made the short list because most people would want to recursively check an entire site.

Xenu's Link Sleuth

Xenu's Link Sleuth is a closed source freeware program for Windows that has several useful features. I looked at version 1.2f. Xenu was easy to install and run. It has a GUI interface, but no command line interface, which limits your ability to run the tool automatically. Some of the features include:
Be aware that the report page it generates shows banner ads. There is no help text, but a good deal of slightly disorganized documentation on the web site plus links to two unofficial manuals.

Another user I contacted tells me that Xenu misbehaved when he ran it against a very large web site. You may need to segment large sites into smaller chunks, reduce the number of threads, and refer to the FAQ entries that discuss other ways to reduce resource usage.

W3C Link Checker

The organization that defines standards for the World Wide Web also provides a variety of tools for checking web pages for problems, including a link checker. This tool is available as a web-hosted service, and the Perl source code is also available to download so you can run it locally from the command line.

As you might expect for a tool sponsored by a standards body, the W3C Link Checker is very thorough. Unlike many other hosted link checking tools, it can search your site recursively up to a specified recursion limit. It will prompt for any necessary passwords needed to get into protected parts of the site, though the dialog confusingly indicates that the request is for the W3C web site rather than the site requesting authentication. All pages that are password-protected get a head-scratching message, "The link is not public. You'd better specify it."

I don't understand many of the things that the tool complains about. For example, it sometimes complains that an anchor has been duplicated dozens of times, but when I look in the html source for my page I don't see the problem. Perhaps I should take the tool's advice to use the W3C HTML validator first, but it's enough of a hassle just keeping links up to date. It's nice to know that's it's checking the anchors along with the links.

The W3C tool checks the robots.txt file on sites before verifying any links on the site. Sites will set up this file if they don't want automated web spiders like search engines to crawl their site, and apparently they apply to link checkers, too. So you may get some messages that say "The link was not checked due to robots exclusion rules. Check the link manually." You know that at least part of the target site is working, since it was able to deliver the robots.txt file. I suppose it's nice that the tool is a good citizen, though the scripts that ignore the robots.txt file can be more thorough. The other tools mentioned here don't check robots.txt. You could probably configure your robots.txt file to allow the W3C Link Checker but not other tools.

This link checker is not very fast. Not only is it single-threaded, but it also imposes a 1-second delay between each link it checks. I imagine that this reduces the load on their web server, but I would probably try to remove this feature if I downloaded the tool myself and decided to use it long-term.

checkbot

Checkbot is a fairly straightforward link checker, implemented in Perl. You will likely need to install a few additional Perl modules before you can install it. I successfully installed it on Windows XP and Debian Linux 3.0. It has a command-line interface and writes its output into files in html format. It cannot authenticate its way past pages that ask for a password.

It can be a bit frustrating waiting for the script to finish if you don't have any visible blinky lights to show network traffic. You can use the --verbose option to watch the status. The results file is sorted into sections, so it's difficult to track the progress by watching the file grow from within.

Checkbot seems to be single-threaded, but it doesn't force any delays between URLs. You cannot limit the recursion depth; it will always spider the entire site.

Checky Plug

Checky Plug is a plugin for Netscape, Mozilla, and Firefox that feeds the page you're currently viewing into any of a long list of hosted web page analyzers, including three different link checkers. It can launch the W3C link checker, the WDG Link Valet, and the WebThing Link Valet, which looks just like the WDG Link Valet.

All it really does is enters the URL of the page you're currently viewing into the appropriate form for the online service you choose. But it can be useful to explore the broad range of analyzers that it makes available, which includes html validators, a meta tag analyzer, accessibility analyzers, and more. If you don't have enough complaints about your web page now, these tools give you more than you ever wanted to hear.

The bottom line

I like Xenu for its impressive speed, and I'll use it to shake out a long list of rotted links on my sites. I'm concerned that it might break down on a really large site, though. Checkbot may be a good place to start if you want a command line interface and don't need to access password-protected pages. Step up to the W3C Link Checker if you want to be very thorough and if you don't mind messing with robots.txt rules. Use Checky Plug for instant gratification to check pages you're currently viewing, and to explore lots of other types of checking you can do against your web pages.

None of these tools can parse Javascript code. As far as I can tell, you'll need a commercial tool to check links generated by Javascript. If someone were motivated to do it, it should be feasible to integrate an open source Javascript interpreter into one of these link checkers.

All of these tools are vulnerable to false hits on sites that are temporarily down. I don't really want to hear about those. Ideally, I'd like a tool that monitors my sites and only reports broken links that have been broken for a specified number of days in a row.