Re: Crawling Through w3c Validator from Nick Kew on 2002-04-09 (www-validator@w3.org from April 2002)

From: Nick Kew <nick@webthing.com>
Date: Tue, 9 Apr 2002 20:36:14 +0100 (BST)
To: Olle Olsson <olleo@sics.se>
cc: Oscar Eg Gensmann <oscar@gensmann.dk>, <www-validator@w3.org>
Message-ID: <20020409195807.L1451-100000@fenris.webthing.com>
On Tue, 9 Apr 2002, Olle Olsson wrote:
>
> Oscar Eg Gensmann wrote:

[ snip ]

What both of you are describing is exactly the kind of functionality
implemented in the Site Valet spider.  Indeed, if you browse the archives
of this list from about six months ago, you will see that I had it
spidering the www.w3.org site itself and reported some statistics
similar to those you are contemplating.

> >  At the moment my idea for the crawling and validation path is
> something like this:

That is indeed similar to the Site Valet spider.  In terms of
implementation it is different

> >  1. Request the main page for the domain.

Indeed.

> >  2. If the main page is smaller than xxx characters mark it in the DB
> as "to small".

We could do that.

> >  3. If theres a HTML redir in the page catch the redir url and follow
> that one as the main page. MArk domain as a HTML redir in
> >  the DB.

That is simply a special case of extracting a link from a page.
Valet extracts all links and enters them in its database.

> >  4. If the main page is a frameset request all the src files for the
> frameset and use the one which is largest. Mark the domain
> >  in the DB as a frameset domain.

That is another instance of extracting links.

> >  5. Find * number of links pointing to the same domain (* might be set
> between 2 to 4 in the first round)
> >
> >  6. Check the * links through the same procedure as the main page. If
> a link does not qualify then grab a new one from the
> >  mainpage pool.

Site Valet differs here.  But the information you are interested in
is immediately available in a database query.

> >  6. Start a loop running through the * links + the main page, doing
> the following:

Take care with that!  A robot that hits its target with rapid-fire
HTTP requests is badly behaved, and likely to be unwelcome.  Site
Valet's strategy is to spider all domains it runs on in parallel,
but ensure no single domain gets more than one GET request per minute.

> >      1. Send the page through a validator (this is where my concerns
> are right know because an internal         validator would be
> >  the best, however I don't seem to be able to finde one.

http://valet.webthing.com/

> >      2. Get the various information from the page and store it into
> the db (DOCTYPE, characters, generator, isvalid, maybe
> >  number of errors, date of validation, etc.)

ditto.  If it doesn't do exactly what you want, it can be customised.

> >  7. Mark the domain as finished and continue to the next domain.

Just enter each domain you are interested in into the database, and
it will be spidered.

> One specific fact complicates the automatic process: many (most?) pages
> do not specify the doctype. Eliminating all those that are quiet about
> doctype might result in nearly all pages being eliminated from analysis.

Such documents are, by definition, not valid.  But we _can_ treat them
as having an implicit doctype, and mark them as "provisionally
valid" based on that.

> So you might end up with the information that 95% of all pages lack
> doctype specification, and with no other information at all about these
> pages. This raises the interesting question about what web browsers
> actually do with these documents.

Each browser will do its own thing.  It is IMHO futile to generalise
about browser behaviour in the face of invalid markup.

> >From the large-scale validation point of view, one should be able to
> handle these cases in some way. What I did was to, when doctype was
> missing, hypothesize  that it was  of some specific type, and then
> revalidate it with explicitly given doctype.

Yep, that's a widely-used strategy.

> This is what  persons working with improving the W3C validator stated:
> the structure of the output might change at any time. And I guess that
> it is a valid warning.

I would be happy to discuss either
  (1) Offering you this as a service on a commercial basis.
or
  (2) Supplying you with software that will enable you to run the
      entire exercise on your own server.

> You have earlier mentioned Tidy as a tool related  to your task. Of
> course, its  different aim makes it not very suitable to what you (and
> me) are trying to achieve.
>
> There is at least one commercial HTML validator available (sorry, I do
> not have access to any link here), and probably some non-commercial ones
> also.

I am offering you one solution.  If you are expecting to do it all
yourself as described and only want the validator, Liam's stuff
at www.htmlhelp.org or at arealvalidator.com might be of interest.

> QUESTION: What experiences do people have with validators that could be
> regarded as alternatives to the W3C one? This concerns question of ease
> of use, portability, but also coverage (how well do they cover the set
> of relevant standards  as  well as cover individual standards).

Either mine or Liam's will rigorously follow the specs, and indeed
use the same underlying parser as the W3C service.  But more generally
you are right to be wary: there are some products that make totally
bogus claims to "validate" HTML.

> >  As you mention a special webservice validator which returns XML
> information about the validation would ofcouse be the best
> >  solution in the long run.

Valet offers you XML.  Not to mention EARL (the W3C/WAI Evaluation and
Report Language, an RDF schema).

> Just as I did. I wanted to get some statistics about conformance, and
> quickly, and not have to either emabark on the task of writing a
> complete validator, and neither have to wait for validators producing
> output in XML.

It has been offering XML for nearly a year.  No need to wait!

> >  Regarding the results, if I should manage to pull this off, I will do
> a danish website with statistic and stuff from the db, and if
> >  interessted it should be possible to translate this site into
> english. I will ofcourse post a message to this list when finished. :-)

I'll be happy to supply the tools you need, and to collaborate with
you in your project if that would help.

-- 
Nick Kew

Site Valet - the mark of Quality on the Web.
<URL:http://valet.webthing.com/>
Received on Tuesday, 9 April 2002 15:36:32 UTC