Re: Bobby Limitatation - Workaround Sought

> Sometimes, the link finding options that are built
> into Bobby are not enough to automatically generate a
> precise list of files for accessibility analysis. In

This should almost certainly be considered an accessibility
failure in its own right.  The site is also a potential 
commercial failure (although many sites would fail on this
criteria) as search engines may also not be able to find those
parts of the site!

Good practice for sites is to include a site map listing all the
static pages.  Any site with a properly maintained site map should
be easy to navigate by such tools, even if a user would have to
go a long way out of their way to navigate using the same means.

> Can anyone recommend a tool that will allow me to
> produce a complete list of all the pages in a web
> site?

I believe this is theoretically impossible once you allow scripting
and Java etc. ("the halting problem").  More specifically, I doubt that
there are any tools that understand Microsoft HTML Help's ActiveX/Java
tree control parameter formats, or even the common idioms for
JavaScript popup pages.

I doubt that any tool can follow links that are implemented by 
selecting from a pull down list, even when done completely server side
(this affectation is normally done client side, with scripting).  Any
such links implemented with POST method forms would be dangerous to
follow.  It's not possible to search the whole parameter space of
a more general form in order to trigger error pages, etc.

Both Lynx and wget are capable of building more or less complete lists
of links from pure, valid, HTML.  I think wget uses a simplified parser,
so might get confused by unusual parameter syntax.  Neither understand
scripting or attempt to submit forms with various parameters.

Lynx should only be used on your own site, or with explicit permission, as
it does not obey the protocols that allow a site to restrict the activity
of such automated tools, not does it pause between requests to avoid
overloading a site.  The robots protocol should not be disabled in wget
without permission of the site owner, nor should the user agent string
be modified, to simulate another browser, in either without permission.

Received on Tuesday, 10 April 2001 05:13:31 UTC