NAME

linkchecker - check links on HTML pages for validity


SYNOPSIS

linkchecker [--follow | --check | --ignore filter-regex ] [ --credentials file ] [ --cookies ] [ --html-summary file ] [ --no-redirect ] starturl


DESCRIPTION

The linkchecker recursively retrieves documents starting at starturl and generates a (lengthy) report about the links it finds.

Any number of filters can be specified with the options --follow, --check, and --ignore. The filters are regular expressions, and each URL is matched against each of them in order. The first match determines what should be done with the URL:

follow

The resource is retrieved. If it is of type text/html, it is parsed and all URLs found are added to the list or URLs to try.

check

The resource is retrieved.

ignore

The resource is not retrieved.

If no filter matches, the URL is checked. If no filter is specified on the command line, a default filter /^\Q$starturl\E.*/ is used.

Examples:

linkchecker http://www.example.org/

will parse all HTML pages on www.example.org and check all links it finds, even those pointing to other domains. It will not follow any links found on other sites.

linkchecker -c ^http://www.example.org/archive/.*/msg.*html -f ^http://www.example.org/ -c ^http:// -i '.*' http://www.example.org/sitemap.html

will start at http://www.example.org/sitemap.html. It will follow links on all pages of www.example.org/, except those on /archive/.*/msg.*html (presumably because broken links are to be expected in an archive of old messages). Of the remaining URLs, it will check all http URLs and ignore the rest.

The option --credentials can be used to specify a credential file. Each line in the file contains 4 values separated by white space: a net location (hostname and port separated by a colon - note that the protocol is not specified and the port has to be specified even if it is the default!), a realm, a username and a password.

If the option --cookies is given, cookies sent by the webserver will be honored. This is especially useful if the webserver falls back to session-ids in URLs if the client doesn't support cookies, because otherwise you may end up with a lot of duplicate URLs.

The output is very verbose and intended to be postprocessed by other tools before being presented to the user. For example,

grep '^http:' linkchecker.out | grep -v ' 200 '

produces a list of all broken http links with a list of pages where they occur.

Alternatively, the option --html-summary may be used to print the summary in HTML format to a separate file. The detailed information about each page is still printed to stdout.

The option --no-redirect turns off automatic processing of redirects. When a page returns a 301 or 302 status code, it this status code is logged and the target of the redirect is added as a new url to be checked (with the orginal url as a fake referrer).


AUTHOR

Peter J. Holzer <hjp@hjp.at>


BUGS

The realm in the credential file cannot contain white space.

The --no-redirect option should probably be default.