Re: Crawling Through w3c Validator from Olle Olsson on 2002-04-09 (www-validator@w3.org from April 2002)

From: Olle Olsson <olleo@sics.se>
Date: Tue, 09 Apr 2002 12:15:39 +0200
To: Oscar Eg Gensmann <oscar@gensmann.dk>
CC: www-validator@w3.org
Message-ID: <3CB2BF4B.550C194A@sics.se>
Oscar Eg Gensmann wrote:

>
>  Dear Olle.
>
>  Rigth after I posted my message to the list I saw the old ones
discussing a similar subject (Validator output and
>  standards conformance), which I assume is the ones you're referring
to?

Yes, that's the ones.

>  It's interessting to hear you experience about the subject,
especially the idea of qualifying pages before validating them. Just
>  as of yesterday I thought of a somewhat similar idea to minimise the
load on the validating :).
>
>  At the moment my idea for the crawling and validation path is
something like this:
>
>  -----------------------
>
>  1. Request the main page for the domain.
>
>  2. If the main page is smaller than xxx characters mark it in the DB
as "to small".
>
>  3. If theres a HTML redir in the page catch the redir url and follow
that one as the main page. MArk domain as a HTML redir in
>  the DB.
>
>  4. If the main page is a frameset request all the src files for the
frameset and use the one which is largest. Mark the domain
>  in the DB as a frameset domain.
>
>  5. Find * number of links pointing to the same domain (* might be set
between 2 to 4 in the first round)
>
>  6. Check the * links through the same procedure as the main page. If
a link does not qualify then grab a new one from the
>  mainpage pool.
>
>  6. Start a loop running through the * links + the main page, doing
the following:
>
>      1. Send the page through a validator (this is where my concerns
are right know because an internal         validator would be
>  the best, however I don't seem to be able to finde one.
>
>      2. Get the various information from the page and store it into
the db (DOCTYPE, characters, generator, isvalid, maybe
>  number of errors, date of validation, etc.)
>
>  7. Mark the domain as finished and continue to the next domain.
>
>  ------------------------

One specific fact complicates the automatic process: many (most?) pages
do not specify the doctype. Eliminating all those that are quiet about
doctype might result in nearly all pages being eliminated from analysis.
This is what the W3C validator does most of the time (and for good
reasons!) That is, the validators  response typically is:

    Fatal Error:  no document type declaration; will parse without
validation
    I could not parse this document, because it uses a public identifier
that
          is not in my catalog.

So you might end up with the information that 95% of all pages lack
doctype specification, and with no other information at all about these
pages. This raises the interesting question about what web browsers
actually do with these documents. Do they scan the document to determine
what type it is? Do they assume that it is of a specific type, and if
this turns out to be false, make another assumption and restart
processing the page? or do they try to make sense out of a page lacking
a  doctype specification, and if so, how can we believe that we actually
know how they are presenting the page? This is really a black hole!

From the large-scale validation point of view, one should be able to
handle these cases in some way. What I did was to, when doctype was
missing, hypothesize  that it was  of some specific type, and then
revalidate it with explicitly given doctype. This is a kind of guessing
that can go wrong, of course. But what else can one do? But think about
the extra load it might put on a shared  validator resource. Re-scanning
the same document many times, just to see which "doctype=X" validation
it passes the best!


>  This is for the moment a brief description the path I'm planing to
use.
>
>  I do realise that if if go through the online w3c validator there may
be some problems.
>
>  First of all I'm not sure wether or not the responsible people for
the validator service, is glad of me "bombing" the service (this
>  is a basic program I'm doing which should be able to batch
validate/crawl a number of domains wether it be all .dk domains or
>  just a small pool of domains), because i'm planning to do the system
so that i can distribute the crawling application to a
>  number of computers to increase the speed of the crawling and also
being multithreaded it might do some serious
>  performance impact on the online validator, depending on the online
validators strength (there are currently about 100000 .dk
>  domains as far as I remember, which will require in total about
300k--500k pagevalidations through the validator (just an
>  estimate) in the first batch run.
>
>  Second I know that the validator might be changing it's output, which
will corrupt the result if happening halfway through
>  crawling.

This is what  persons working with improving the W3C validator stated:
the structure of the output might change at any time. And I guess that
it is a valid warning.

>  So to avoid these problems I was thinking of doing a local install of
the w3c validator on a server and run every page through
>  that which should solve some of the problems. Unfortunately I'm, as I
wrote earlier, not a very skilled perl programmer or linux
>  wiz so I'm having some problems with this.
>
>  At the moment I'm looking to convincing some of my linux friends to
help me out with that part, but if any of you here on the
>  list have some information about how I/we could do this easily please
let me know. If some of you should be interessted in
>  helping me getting a local version up, maybe on a server you can
spare or on my own win2k server please contact me. You're
>  welcome to mail me, send me an ICQ- or MSN messenger message at any
time. Also if you just want to hear about my little
>  project. You'll find my contact information on my webpage
www.gensmann.com , which is sadly enough one of those pages I
>  never get to finish, however I think it should be w3c valid as far as
I remember :-)
>
>  Should you know of a windows/asp alternative or something similar,
information about these would also be gladly recieved. I
>  have looked into TidyCOM and the CSE validator, unfortunately they do
not work quite as well as the w3c regarding the output
>  and doctyp check.

You have earlier mentioned Tidy as a tool related  to your task. Of
course, its  different aim makes it not very suitable to what you (and
me) are trying to achieve.

There is at least one commercial HTML validator available (sorry, I do
not have access to any link here), and probably some non-commercial ones
also.

QUESTION: What experiences do people have with validators that could be
regarded as alternatives to the W3C one? This concerns question of ease
of use, portability, but also coverage (how well do they cover the set
of relevant standards  as  well as cover individual standards).

>  As you mention a special webservice validator which returns XML
information about the validation would ofcouse be the best
>  solution in the long run. However working with online solutions
myself I know that these type of things might take some time
>  for the validator team to pull of and I could delay my projekt to
when something like this is available, however, when you got
>  the idea and the drive, sometimes you just have to use whats
possible, because otherwise it will end up like one of those
>  1000 of others ideas you never finished :-)

Just as I did. I wanted to get some statistics about conformance, and
quickly, and not have to either emabark on the task of writing a
complete validator, and neither have to wait for validators producing
output in XML. So for me it was  better to se what could be done with
what was available, and this might lead  to insights about what I would
really like from a validator tool.

/olle

>  Regarding the results, if I should manage to pull this off, I will do
a danish website with statistic and stuff from the db, and if
>  interessted it should be possible to translate this site into
english. I will ofcourse post a message to this list when finished. :-)
>
>  Yours faithfully
>  Oscar Eg Gensmann
>
>  ----- Original Message -----
>
>       From: Olle Olsson
>       To: Oscar Eg Gensmann
>       Cc: www-validator@w3.org ; O. Olsson
>       Sent: Tuesday, April 09, 2002 8:51 AM
>       Subject: Re: Crawling Through w3c Validator
>
>       Sounds interesting.



--
------------------------------------------------------------------
Olle Olsson   olleo@sics.se   Tel: +46 8 633 15 19  Fax: +46 8 751 72 30

 [Svenska W3C-kontoret: olleo@w3.org]
SICS [Swedish Institute of Computer Science]
Box 1263
SE - 164 29 Kista
Sweden
------------------------------------------------------------------
Received on Tuesday, 9 April 2002 06:15:14 UTC