- From: Olle Olsson <olleo@sics.se>
- Date: Tue, 09 Apr 2002 12:15:39 +0200
- To: Oscar Eg Gensmann <oscar@gensmann.dk>
- CC: www-validator@w3.org
Oscar Eg Gensmann wrote: > > Dear Olle. > > Rigth after I posted my message to the list I saw the old ones discussing a similar subject (Validator output and > standards conformance), which I assume is the ones you're referring to? Yes, that's the ones. > It's interessting to hear you experience about the subject, especially the idea of qualifying pages before validating them. Just > as of yesterday I thought of a somewhat similar idea to minimise the load on the validating :). > > At the moment my idea for the crawling and validation path is something like this: > > ----------------------- > > 1. Request the main page for the domain. > > 2. If the main page is smaller than xxx characters mark it in the DB as "to small". > > 3. If theres a HTML redir in the page catch the redir url and follow that one as the main page. MArk domain as a HTML redir in > the DB. > > 4. If the main page is a frameset request all the src files for the frameset and use the one which is largest. Mark the domain > in the DB as a frameset domain. > > 5. Find * number of links pointing to the same domain (* might be set between 2 to 4 in the first round) > > 6. Check the * links through the same procedure as the main page. If a link does not qualify then grab a new one from the > mainpage pool. > > 6. Start a loop running through the * links + the main page, doing the following: > > 1. Send the page through a validator (this is where my concerns are right know because an internal validator would be > the best, however I don't seem to be able to finde one. > > 2. Get the various information from the page and store it into the db (DOCTYPE, characters, generator, isvalid, maybe > number of errors, date of validation, etc.) > > 7. Mark the domain as finished and continue to the next domain. > > ------------------------ One specific fact complicates the automatic process: many (most?) pages do not specify the doctype. Eliminating all those that are quiet about doctype might result in nearly all pages being eliminated from analysis. This is what the W3C validator does most of the time (and for good reasons!) That is, the validators response typically is: Fatal Error: no document type declaration; will parse without validation I could not parse this document, because it uses a public identifier that is not in my catalog. So you might end up with the information that 95% of all pages lack doctype specification, and with no other information at all about these pages. This raises the interesting question about what web browsers actually do with these documents. Do they scan the document to determine what type it is? Do they assume that it is of a specific type, and if this turns out to be false, make another assumption and restart processing the page? or do they try to make sense out of a page lacking a doctype specification, and if so, how can we believe that we actually know how they are presenting the page? This is really a black hole! From the large-scale validation point of view, one should be able to handle these cases in some way. What I did was to, when doctype was missing, hypothesize that it was of some specific type, and then revalidate it with explicitly given doctype. This is a kind of guessing that can go wrong, of course. But what else can one do? But think about the extra load it might put on a shared validator resource. Re-scanning the same document many times, just to see which "doctype=X" validation it passes the best! > This is for the moment a brief description the path I'm planing to use. > > I do realise that if if go through the online w3c validator there may be some problems. > > First of all I'm not sure wether or not the responsible people for the validator service, is glad of me "bombing" the service (this > is a basic program I'm doing which should be able to batch validate/crawl a number of domains wether it be all .dk domains or > just a small pool of domains), because i'm planning to do the system so that i can distribute the crawling application to a > number of computers to increase the speed of the crawling and also being multithreaded it might do some serious > performance impact on the online validator, depending on the online validators strength (there are currently about 100000 .dk > domains as far as I remember, which will require in total about 300k--500k pagevalidations through the validator (just an > estimate) in the first batch run. > > Second I know that the validator might be changing it's output, which will corrupt the result if happening halfway through > crawling. This is what persons working with improving the W3C validator stated: the structure of the output might change at any time. And I guess that it is a valid warning. > So to avoid these problems I was thinking of doing a local install of the w3c validator on a server and run every page through > that which should solve some of the problems. Unfortunately I'm, as I wrote earlier, not a very skilled perl programmer or linux > wiz so I'm having some problems with this. > > At the moment I'm looking to convincing some of my linux friends to help me out with that part, but if any of you here on the > list have some information about how I/we could do this easily please let me know. If some of you should be interessted in > helping me getting a local version up, maybe on a server you can spare or on my own win2k server please contact me. You're > welcome to mail me, send me an ICQ- or MSN messenger message at any time. Also if you just want to hear about my little > project. You'll find my contact information on my webpage www.gensmann.com , which is sadly enough one of those pages I > never get to finish, however I think it should be w3c valid as far as I remember :-) > > Should you know of a windows/asp alternative or something similar, information about these would also be gladly recieved. I > have looked into TidyCOM and the CSE validator, unfortunately they do not work quite as well as the w3c regarding the output > and doctyp check. You have earlier mentioned Tidy as a tool related to your task. Of course, its different aim makes it not very suitable to what you (and me) are trying to achieve. There is at least one commercial HTML validator available (sorry, I do not have access to any link here), and probably some non-commercial ones also. QUESTION: What experiences do people have with validators that could be regarded as alternatives to the W3C one? This concerns question of ease of use, portability, but also coverage (how well do they cover the set of relevant standards as well as cover individual standards). > As you mention a special webservice validator which returns XML information about the validation would ofcouse be the best > solution in the long run. However working with online solutions myself I know that these type of things might take some time > for the validator team to pull of and I could delay my projekt to when something like this is available, however, when you got > the idea and the drive, sometimes you just have to use whats possible, because otherwise it will end up like one of those > 1000 of others ideas you never finished :-) Just as I did. I wanted to get some statistics about conformance, and quickly, and not have to either emabark on the task of writing a complete validator, and neither have to wait for validators producing output in XML. So for me it was better to se what could be done with what was available, and this might lead to insights about what I would really like from a validator tool. /olle > Regarding the results, if I should manage to pull this off, I will do a danish website with statistic and stuff from the db, and if > interessted it should be possible to translate this site into english. I will ofcourse post a message to this list when finished. :-) > > Yours faithfully > Oscar Eg Gensmann > > ----- Original Message ----- > > From: Olle Olsson > To: Oscar Eg Gensmann > Cc: www-validator@w3.org ; O. Olsson > Sent: Tuesday, April 09, 2002 8:51 AM > Subject: Re: Crawling Through w3c Validator > > Sounds interesting. -- ------------------------------------------------------------------ Olle Olsson olleo@sics.se Tel: +46 8 633 15 19 Fax: +46 8 751 72 30 [Svenska W3C-kontoret: olleo@w3.org] SICS [Swedish Institute of Computer Science] Box 1263 SE - 164 29 Kista Sweden ------------------------------------------------------------------
Received on Tuesday, 9 April 2002 06:15:14 UTC