- From: Terje Bless <link@pobox.com>
- Date: Tue, 20 May 2003 19:27:34 +0200
- To: W3C Validator <www-validator@w3.org>
- cc: Andrew Walker <K982145@kingston.ac.uk>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andrew Walker <K982145@kingston.ac.uk> wrote: >I am carrying out a small scale web crawl (for an MSc research project) >and was looking to attempt to validate those pages found that have a >doctype (currently this stands at around 10000 pages) > >I was wanting to check if there was any limit imposed on the number of >queries made per day to the validator and if automated querying (albeit >slowly) was frowned upon. The Validator imposes no restrictions on the number of queries per day and automated querying is allowed provided one applies a healthy dose of common sense. For your example, 10000 URIs to check works out to about one request per 8 seconds to check all the documents over a 24 hour period (10k * 8 = 80k vs. 86400 seconds in a day), one per 4 seconds for 12 hours total, and one per 2 seconds for 6 hours to complete. Given the normal performance of a Validation run, and response times of web sites, I would suggest you simply run all your URL sequentially (iow, not paralellized) and it should work out to something between 2 and 8 seconds (6 to 24 hours to complete) on average. That should not place an unacceptable load on the Validator or on the sites checked and you could then either batch process the results afterwards or use some queuing scheme to process the results in a separate thread/process. (but for gods sake test your code on a small subset of the URIs before unleashing it on the world! ;D) If you are looking to perform this operation multiple times for all 10k URLs you should investiage installing the Validator on a local machine and/or building your own front-end to OpenSP to perform the Validation. For the latter the code for the Validator is available and much of it can be used directly in a custom wrapper class. Please feel free to ask on the list, or stop by #validator in irc.freenode.net, if you need any help. - -- I have to admit that I'm hoping the current situation with regard to XML Namespaces and W3C XML Schemas is a giant practical joke, but I see no signs of pranksters coming forward with a gleeful smile to announce that they were just kidding. -- Simon St.Laurent -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0 iQA/AwUBPsplhjfq/Hm0uHR3EQKQmACgq6xPphJDKHSD18jzQzAZRAvwgZEAoOLt SvH8MSklUzMG/K01drmUKxFA =4kS7 -----END PGP SIGNATURE-----
Received on Tuesday, 20 May 2003 13:27:54 UTC