Re: Crawling Through w3c Validator from Oscar Eg Gensmann on 2002-04-09 (www-validator@w3.org from April 2002)

From: Oscar Eg Gensmann <oscar@gensmann.dk>
Date: Tue, 9 Apr 2002 10:37:05 +0200
To: <www-validator@w3.org>
Message-ID: <002b01c1dfa1$badf4e50$5928968c@OEG>

Dear Olle.

Rigth after I posted my message to the list I saw the old ones discussing a similar subject (Validator output and standards conformance), which I assume is the ones you're referring to?

It's interessting to hear you experience about the subject, especially the idea of qualifying pages before validating them. Just as of yesterday I thought of a somewhat similar idea to minimise the load on the validating :).

At the moment my idea for the crawling and validation path is something like this:

-----------------------

1. Request the main page for the domain.

2. If the main page is smaller than xxx characters mark it in the DB as "to small".

3. If theres a HTML redir in the page catch the redir url and follow that one as the main page. MArk domain as a HTML redir in the DB.

4. If the main page is a frameset request all the src files for the frameset and use the one which is largest. Mark the domain in the DB as a frameset domain.

5. Find * number of links pointing to the same domain (* might be set between 2 to 4 in the first round)

6. Check the * links through the same procedure as the main page. If a link does not qualify then grab a new one from the mainpage pool.

6. Start a loop running through the * links + the main page, doing the following:

1. Send the page through a validator (this is where my concerns are right know because an internal validator would be the best, however I don't seem to be able to finde one.

2. Get the various information from the page and store it into the db (DOCTYPE, characters, generator, isvalid, maybe number of errors, date of validation, etc.)

7. Mark the domain as finished and continue to the next domain.

------------------------

This is for the moment a brief description the path I'm planing to use.

I do realise that if if go through the online w3c validator there may be some problems.

First of all I'm not sure wether or not the responsible people for the validator service, is glad of me "bombing" the service (this is a basic program I'm doing which should be able to batch validate/crawl a number of domains wether it be all .dk domains or just a small pool of domains), because i'm planning to do the system so that i can distribute the crawling application to a number of computers to increase the speed of the crawling and also being multithreaded it might do some serious performance impact on the online validator, depending on the online validators strength (there are currently about 100000 .dk domains as far as I remember, which will require in total about 300k--500k pagevalidations through the validator (just an estimate) in the first batch run.

Second I know that the validator might be changing it's output, which will corrupt the result if happening halfway through crawling.

So to avoid these problems I was thinking of doing a local install of the w3c validator on a server and run every page through that which should solve some of the problems. Unfortunately I'm, as I wrote earlier, not a very skilled perl programmer or linux wiz so I'm having some problems with this.

At the moment I'm looking to convincing some of my linux friends to help me out with that part, but if any of you here on the list have some information about how I/we could do this easily please let me know. If some of you should be interessted in helping me getting a local version up, maybe on a server you can spare or on my own win2k server please contact me. You're welcome to mail me, send me an ICQ- or MSN messenger message at any time. Also if you just want to hear about my little project. You'll find my contact information on my webpage www.gensmann.com , which is sadly enough one of those pages I never get to finish, however I think it should be w3c valid as far as I remember :-)

Should you know of a windows/asp alternative or something similar, information about these would also be gladly recieved. I have looked into TidyCOM and the CSE validator, unfortunately they do not work quite as well as the w3c regarding the output and doctyp check.

As you mention a special webservice validator which returns XML information about the validation would ofcouse be the best solution in the long run. However working with online solutions myself I know that these type of things might take some time for the validator team to pull of and I could delay my projekt to when something like this is available, however, when you got the idea and the drive, sometimes you just have to use whats possible, because otherwise it will end up like one of those 1000 of others ideas you never finished :-)

Regarding the results, if I should manage to pull this off, I will do a danish website with statistic and stuff from the db, and if interessted it should be possible to translate this site into english. I will ofcourse post a message to this list when finished. :-)

Yours faithfully
Oscar Eg Gensmann

----- Original Message -----
From: Olle Olsson
To: Oscar Eg Gensmann
Cc: www-validator@w3.org ; O. Olsson
Sent: Tuesday, April 09, 2002 8:51 AM
Subject: Re: Crawling Through w3c Validator

Sounds interesting.

Received on Tuesday, 9 April 2002 04:37:09 UTC