Fw: Crawling Through w3c Validator from Oscar Eg Gensmann on 2002-04-09 (www-validator@w3.org from April 2002)

From: Oscar Eg Gensmann <oscar@gensmann.dk>
Date: Tue, 9 Apr 2002 13:58:12 +0200
To: <www-validator@w3.org>
Message-ID: <008c01c1dfbd$d2cae5d0$5928968c@OEG>
Ups smart one - forgot to send this to the list. Here a forwarding of the
mail sent to Olle. Sorry Olle for you recieving this mail twice :-)

----- Original Message -----
From: "Oscar Eg Gensmann" <oscar@gensmann.dk>
To: "Olle Olsson" <olleo@sics.se>
Sent: Tuesday, April 09, 2002 1:56 PM
Subject: Re: Crawling Through w3c Validator


> Olle Wrote:
>
> > One specific fact complicates the automatic process: many (most?) pages
> > do not specify the doctype. Eliminating all those that are quiet about
> > doctype might result in nearly all pages being eliminated from analysis.
> > This is what the W3C validator does most of the time (and for good
> > reasons!) That is, the validators  response typically is:
> >
> >     Fatal Error:  no document type declaration; will parse without
> > validation
> >     I could not parse this document, because it uses a public identifier
> > that
> >           is not in my catalog.
>
> I have identified this problem and was actually thinking of handling it by
> marking the page in the DB as having no doctype and then if possible run
it
> through the validator as HTML 4.01 (forced doctype).
>
> My expeirence with most danish pages/authors are that if they don't have a
> doctype they won't validate (if they are just a little bit complex),
because
> it's a sign of the author not knowing what valid HTML code is. I don't
know
> if this is a worldwide tendency however I somehow believe it to be.
>
> I realise that this may cause the db to consist of 50+% of not valid
> documents, however for my current project this isn't going to be a big
issue
> because what i'm trying to do is to show the public some of the HTML
> problems with danish websites.
>
> From my point of view I think one of the biggest problems is the missing
> doctype declaration in the documents. I do realise that this might from a
> scientific perspective may be wrong because you might try to validate a
> document to check the code anyway.
>
> By being able to show to public (and hopefully some diffirent IT
newspapers
> aso) that 50%+ sites of the danish web aren't even using a DOC type is by
> itself a goal, because my project is meant to make people aware of w3c
> standards and will be backed up by a series of articles about w3c
standards
> and  what they can do for a site.
>
> After some thought though, I actually think I will do the first batch
crawl
> by elimination all the pages that doesn't have a doc type. Mainly because
of
> ease, but also because it will minimize the number og pages I actually
have
> to send through the validator the first time. They will then be marked in
> the DB as "No doctype"
>
> Regarding of the result and speed I might do some more runs where I for
> instance just run through the missing doctypes sites with a forced doctype
> like HTML 4.01 and count the errors of the pages. By doing the runs in
this
> "Step" method I will be able to get some results a little faster. and get
a
> feeling of how long the runs are taking aso. Should I get carried away i
> might do a check on the ones which validates as HTML 4.01 to check for
XHTML
> aso, but main priority will be to get started :D
>
> > There is at least one commercial HTML validator available (sorry, I do
> > not have access to any link here), and probably some non-commercial ones
> > also.
>
> It might be the CSE validator you're thinking of:
> http://www.htmlvalidator.com/
>
> However when I checked it seemed like it was only using a forced doctype,
> however I might be wrong. Allthough it would be more reliable to use the
w3c
> validator (you know people somehow respect the w3c a little more than some
> unknown validator, eventhough they don't do valid webpages most know or
have
> heard about w3c :-)
>
>
> > QUESTION: What experiences do people have with validators that could be
> > regarded as alternatives to the W3C one? This concerns question of ease
> > of use, portability, but also coverage (how well do they cover the set
> > of relevant standards  as  well as cover individual standards).
>
> My experience is that most alternative validators available to public use
> are most focusing on correcting Errors when the pass the document (like
> Tidy) or to validate against a specific doctype defined from the start.
> (like CSE). During my search the last couple of days i found quite a lot
of
> Perl stuff, however all of them would not be something the regular john
doe
> web guy would install. Besides the perl I only saw some few online
versions
> (like w3c) and then the Tidy and CSE validator. And then a bunch of HTMl
> editors incorporating those validators.
>
> If you look for validators on search engines you most likely end up at
> http://validator.w3c.org in someway or the other :-)
>
> I'm a windows guy for the most time (not evangalist, just using it) so I
> don't know about alternatives in java aso, however I seem to recall some
> during my search.
>
> But anyway let this be a big "yoho" to the w3c validator team to get
started
> on a validator which spits out xml for all us wierd people wanting to do
> valid crawling checks and others :D
>
> Best regards
> Oscar Eg Gensmann
>
>
>
> > >  ----- Original Message -----
> > >
> > >       From: Olle Olsson
> > >       To: Oscar Eg Gensmann
> > >       Cc: www-validator@w3.org ; O. Olsson
> > >       Sent: Tuesday, April 09, 2002 8:51 AM
> > >       Subject: Re: Crawling Through w3c Validator
> > >
> > >       Sounds interesting.
> >
> >
> >
> > --
> > ------------------------------------------------------------------
> > Olle Olsson   olleo@sics.se   Tel: +46 8 633 15 19  Fax: +46 8 751 72 30
> >
> >  [Svenska W3C-kontoret: olleo@w3.org]
> > SICS [Swedish Institute of Computer Science]
> > Box 1263
> > SE - 164 29 Kista
> > Sweden
> > ------------------------------------------------------------------
> >
> >
> >
> >
>
Received on Tuesday, 9 April 2002 07:58:28 UTC