Re: Automated queries and query limits

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Walker <K982145@kingston.ac.uk> wrote:

>I am carrying out a small scale web crawl (for an MSc research project)
>and was looking to attempt to validate those pages found that have a
>doctype (currently this stands at around 10000 pages)
>
>I was wanting to check if there was any limit imposed on the number of
>queries made per day to the validator and if automated querying (albeit
>slowly) was frowned upon.

The Validator imposes no restrictions on the number of queries per day and
automated querying is allowed provided one applies a healthy dose of common
sense.

For your example, 10000 URIs to check works out to about one request per 8
seconds to check all the documents over a 24 hour period (10k * 8 = 80k vs.
86400 seconds in a day), one per 4 seconds for 12 hours total, and one per 2
seconds for 6 hours to complete.

Given the normal performance of a Validation run, and response times of web
sites, I would suggest you simply run all your URL sequentially (iow, not
paralellized) and it should work out to something between 2 and 8 seconds (6
to 24 hours to complete) on average. That should not place an unacceptable
load on the Validator or on the sites checked and you could then either batch
process the results afterwards or use some queuing scheme to process the
results in a separate thread/process.

(but for gods sake test your code on a small subset of the URIs before
unleashing it on the world! ;D)


If you are looking to perform this operation multiple times for all 10k URLs
you should investiage installing the Validator on a local machine and/or
building your own front-end to OpenSP to perform the Validation. For the
latter the code for the Validator is available and much of it can be used
directly in a custom wrapper class.

Please feel free to ask on the list, or stop by #validator in
irc.freenode.net, if you need any help.



- -- 
I have to admit that I'm hoping the current situation with regard to XML
Namespaces and W3C XML Schemas is a giant practical joke,   but I see no
signs of pranksters coming forward with a gleeful smile to announce that
they were just kidding.                              -- Simon St.Laurent

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0

iQA/AwUBPsplhjfq/Hm0uHR3EQKQmACgq6xPphJDKHSD18jzQzAZRAvwgZEAoOLt
SvH8MSklUzMG/K01drmUKxFA
=4kS7
-----END PGP SIGNATURE-----

Received on Tuesday, 20 May 2003 13:27:54 UTC