Re: Crawling Through w3c Validator from Olle Olsson on 2002-04-09 (www-validator@w3.org from April 2002)

From: Olle Olsson <olleo@sics.se>
Date: Tue, 09 Apr 2002 08:51:05 +0200
To: Oscar Eg Gensmann <oscar@gensmann.dk>
CC: www-validator@w3.org, "O. Olsson" <olleo@w3.org>
Message-ID: <3CB28F58.30394970@sics.se>
Sounds interesting.

I have done a small test with a similar goal. But instead of doing a
huge scan, I just selected a sample set of pages  and validated them.

My experience is that it is non-trivial to perform this kind of task
using the W3C-validator. Mainly for the reason discussed recently on
this list; the W3C validator (as it is right now) is not built to
provide easily software-accessible error disagnostics. When a new
version is available, providing diagnostic reports in XML, the task will
be much simplified.

Another cause of problems is that some sites do not deliver sensible
information to the validator. Maybe they try to sense the user-agent
type and be clever about what to deliver, and when the validator is the
user-agent, they do not provide sensible pages.

Furthermore, some pages are re-directions, in practice consisting only
of some HEAD-element as
    <meta http-equiv="refresh" content="10;URL=some.url">
which means that the actual page to inspect is the one at "some.url",
not the page containing the re-direction.

Then we have pages that only contain a FRAMESET element, which means
that the real page content comes from other pages. Such FRAMESET pages
often are small and , due to their limited structure, they may very well
be acceptable from a conformance point of view, though the full set of
HTML that an ordinary web-browser encounters when hitting the FRAMESET
page may very well be huge and error-ridden.

What I did in the analysis  of a set of pages was to disregard those
pages that were remarkably small (there were pages less than 100
characters long, and how much sensible info can you cram into a document
of that size), or really huge (I saw one that, as a single file, was
129K characters, and that should be regarded as being of an
unmaintainable size). When computing statistics of pages  there is a
real risk of the statistics being perverted by a smaller number of pages
that have abnormal characteristics.

Still, what you are trying to do is really interesting and important.
There seems to be a lack of statistical information about web standards
conformance. And without such statistics, how do we know where  we, as a
technical sector, stand? What quality criteria should be fulfilled to
what extent? And what tools and methods are in practice  needed to get
the quality up to an acceptable level?

I would very much like to see some data from your investigations. Please
make an announcement when you believe you have seen some patterns.

/olle


Oscar Eg Gensmann wrote:

> Dear Validator List.I posted this message once when I hadn't joined
> the list. It seems like it didn't got through my mail server så just
> in case I'll post it right to the list now that I have joined. Please
> forgive an eventually double posting.------------I am currently
> working on a project which intent to crawl a huge number of domains
> (all .dk domains) and do a test check on some random pages within that
> domain to see if they are using valid HTML. The result should be a
> searchable database indikating which domains are using valid code and
> some other info.The database is going to be a foundation for at danish
> website I'm constructing about using valid HTML code and the
> advantages of it. It' will contain links, information and articles
> about valid HTML coding a.s.o. The database will provide statistic
> information about the current state of danish websites. My hope is
> that it will be possible to do more than one crawl of the sites during
> time, however at the moment I'm only trying to get the first crawl.I
> realise that sending this amount of pages through the online w3c
> validator using the crawler I have build maybe will have influence on
> the online service. I have tried to install the validator locally on a
> win2k server, however not being a perl guy it gives me a rather large
> amount of trouble. I'm normally doing "light" asp.net programming and
> a little .Net Windows programming for the crawler.My question is now,
> will it be possible (legal?) for me to go through the online
> validator? If not, does anyone have a suggestion for me how I can do a
> local install on a win2k server not being a great perl wiz, or maybe
> you know some way to actually integrate the validator into my windows
> app using vb.net or something similar?I have tried the TidyCOM which
> works marvelous well in my app, however the program seems more
> designed for changing inputcode, than just plain and simple validating
> it, so I discarded it after a couple of try's because it didn't seem
> to warn about certain errors, but just fix them. Does there exist
> something like a W3CValidatorCOM or similar?I hope some of you are
> able to help me, in the interest of educating people about w3c
> standards, and maybe can point me in a direction for solving my
> problem.Yours FaithfullyOscar Eg Gensmann-- Oscar@Gensmann.com

--
------------------------------------------------------------------
Olle Olsson   olleo@sics.se   Tel: +46 8 633 15 19  Fax: +46 8 751 72 30

 [Svenska W3C-kontoret: olleo@w3.org]
SICS [Swedish Institute of Computer Science]
Box 1263
SE - 164 29 Kista
Sweden
------------------------------------------------------------------
Received on Tuesday, 9 April 2002 02:50:48 UTC