- From: Oscar Eg Gensmann <oscar@gensmann.dk>
- Date: Tue, 9 Apr 2002 13:58:12 +0200
- To: <www-validator@w3.org>
Ups smart one - forgot to send this to the list. Here a forwarding of the mail sent to Olle. Sorry Olle for you recieving this mail twice :-) ----- Original Message ----- From: "Oscar Eg Gensmann" <oscar@gensmann.dk> To: "Olle Olsson" <olleo@sics.se> Sent: Tuesday, April 09, 2002 1:56 PM Subject: Re: Crawling Through w3c Validator > Olle Wrote: > > > One specific fact complicates the automatic process: many (most?) pages > > do not specify the doctype. Eliminating all those that are quiet about > > doctype might result in nearly all pages being eliminated from analysis. > > This is what the W3C validator does most of the time (and for good > > reasons!) That is, the validators response typically is: > > > > Fatal Error: no document type declaration; will parse without > > validation > > I could not parse this document, because it uses a public identifier > > that > > is not in my catalog. > > I have identified this problem and was actually thinking of handling it by > marking the page in the DB as having no doctype and then if possible run it > through the validator as HTML 4.01 (forced doctype). > > My expeirence with most danish pages/authors are that if they don't have a > doctype they won't validate (if they are just a little bit complex), because > it's a sign of the author not knowing what valid HTML code is. I don't know > if this is a worldwide tendency however I somehow believe it to be. > > I realise that this may cause the db to consist of 50+% of not valid > documents, however for my current project this isn't going to be a big issue > because what i'm trying to do is to show the public some of the HTML > problems with danish websites. > > From my point of view I think one of the biggest problems is the missing > doctype declaration in the documents. I do realise that this might from a > scientific perspective may be wrong because you might try to validate a > document to check the code anyway. > > By being able to show to public (and hopefully some diffirent IT newspapers > aso) that 50%+ sites of the danish web aren't even using a DOC type is by > itself a goal, because my project is meant to make people aware of w3c > standards and will be backed up by a series of articles about w3c standards > and what they can do for a site. > > After some thought though, I actually think I will do the first batch crawl > by elimination all the pages that doesn't have a doc type. Mainly because of > ease, but also because it will minimize the number og pages I actually have > to send through the validator the first time. They will then be marked in > the DB as "No doctype" > > Regarding of the result and speed I might do some more runs where I for > instance just run through the missing doctypes sites with a forced doctype > like HTML 4.01 and count the errors of the pages. By doing the runs in this > "Step" method I will be able to get some results a little faster. and get a > feeling of how long the runs are taking aso. Should I get carried away i > might do a check on the ones which validates as HTML 4.01 to check for XHTML > aso, but main priority will be to get started :D > > > There is at least one commercial HTML validator available (sorry, I do > > not have access to any link here), and probably some non-commercial ones > > also. > > It might be the CSE validator you're thinking of: > http://www.htmlvalidator.com/ > > However when I checked it seemed like it was only using a forced doctype, > however I might be wrong. Allthough it would be more reliable to use the w3c > validator (you know people somehow respect the w3c a little more than some > unknown validator, eventhough they don't do valid webpages most know or have > heard about w3c :-) > > > > QUESTION: What experiences do people have with validators that could be > > regarded as alternatives to the W3C one? This concerns question of ease > > of use, portability, but also coverage (how well do they cover the set > > of relevant standards as well as cover individual standards). > > My experience is that most alternative validators available to public use > are most focusing on correcting Errors when the pass the document (like > Tidy) or to validate against a specific doctype defined from the start. > (like CSE). During my search the last couple of days i found quite a lot of > Perl stuff, however all of them would not be something the regular john doe > web guy would install. Besides the perl I only saw some few online versions > (like w3c) and then the Tidy and CSE validator. And then a bunch of HTMl > editors incorporating those validators. > > If you look for validators on search engines you most likely end up at > http://validator.w3c.org in someway or the other :-) > > I'm a windows guy for the most time (not evangalist, just using it) so I > don't know about alternatives in java aso, however I seem to recall some > during my search. > > But anyway let this be a big "yoho" to the w3c validator team to get started > on a validator which spits out xml for all us wierd people wanting to do > valid crawling checks and others :D > > Best regards > Oscar Eg Gensmann > > > > > > ----- Original Message ----- > > > > > > From: Olle Olsson > > > To: Oscar Eg Gensmann > > > Cc: www-validator@w3.org ; O. Olsson > > > Sent: Tuesday, April 09, 2002 8:51 AM > > > Subject: Re: Crawling Through w3c Validator > > > > > > Sounds interesting. > > > > > > > > -- > > ------------------------------------------------------------------ > > Olle Olsson olleo@sics.se Tel: +46 8 633 15 19 Fax: +46 8 751 72 30 > > > > [Svenska W3C-kontoret: olleo@w3.org] > > SICS [Swedish Institute of Computer Science] > > Box 1263 > > SE - 164 29 Kista > > Sweden > > ------------------------------------------------------------------ > > > > > > > > >
Received on Tuesday, 9 April 2002 07:58:28 UTC