Re: Bobby Limitatation - Workaround Sought from Charles McCathieNevile on 2001-04-10 (w3c-wai-ig@w3.org from April to June 2001)

From: Charles McCathieNevile <charles@w3.org>
Date: Tue, 10 Apr 2001 06:56:02 -0400 (EDT)
To: David Woolley <david@djwhome.demon.co.uk>
cc: <w3c-wai-ig@w3.org>
Message-ID: <Pine.LNX.4.30.0104100652130.26402-100000@tux.w3.org>

There is also the original robot built by Henrik Frystyk Nielsen for the
libwww code library. You can teach that to understand Javascript links,
although it takes a bit of basic programming skill. And what Dave said about
having permission goes double for the robot - it is extremely efficient,
which means it is easily capable of urnning amok very fast.

http://www.w3.org/Robot/

Cheers

Charles

On Tue, 10 Apr 2001, David Woolley wrote:

  > Sometimes, the link finding options that are built
  > into Bobby are not enough to automatically generate a
  > precise list of files for accessibility analysis. In

  This should almost certainly be considered an accessibility
  failure in its own right.  The site is also a potential
  commercial failure (although many sites would fail on this
  criteria) as search engines may also not be able to find those
  parts of the site!

  Good practice for sites is to include a site map listing all the
  static pages.  Any site with a properly maintained site map should
  be easy to navigate by such tools, even if a user would have to
  go a long way out of their way to navigate using the same means.

  > Can anyone recommend a tool that will allow me to
  > produce a complete list of all the pages in a web
  > site?

  I believe this is theoretically impossible once you allow scripting
  and Java etc. ("the halting problem").  More specifically, I doubt that
  there are any tools that understand Microsoft HTML Help's ActiveX/Java
  tree control parameter formats, or even the common idioms for
  JavaScript popup pages.

  I doubt that any tool can follow links that are implemented by
  selecting from a pull down list, even when done completely server side
  (this affectation is normally done client side, with scripting).  Any
  such links implemented with POST method forms would be dangerous to
  follow.  It's not possible to search the whole parameter space of
  a more general form in order to trigger error pages, etc.

  Both Lynx and wget are capable of building more or less complete lists
  of links from pure, valid, HTML.  I think wget uses a simplified parser,
  so might get confused by unusual parameter syntax.  Neither understand
  scripting or attempt to submit forms with various parameters.

  Lynx should only be used on your own site, or with explicit permission, as
  it does not obey the protocols that allow a site to restrict the activity
  of such automated tools, not does it pause between requests to avoid
  overloading a site.  The robots protocol should not be disabled in wget
  without permission of the site owner, nor should the user agent string
  be modified, to simulate another browser, in either without permission.

-- 
Charles McCathieNevile    http://www.w3.org/People/Charles  phone: +61 409 134 136
W3C Web Accessibility Initiative     http://www.w3.org/WAI    fax: +1 617 258 5999
Location: 21 Mitchell street FOOTSCRAY Vic 3011, Australia
(or W3C INRIA, Route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France)

Received on Tuesday, 10 April 2001 06:56:23 UTC