Re: Bug 85/4494 (keeping track of validation statistics for various purposes)

[this got lost in the shuffle, many sorries for the delay]

Nikita The Spider The Spider wrote:
> On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote:
>> On Wed, 6 Feb 2008, olivier Thereaux wrote:
>>
>>> * stats on the documents themselves. Doctype, mime type, charset.
>>> Ideally, whether charset is in HTTP, XML decl, meta. There are
>>> existing studies about these, but another study made on a different
>>> sample would bring more perspective.
> 
> Out of curiousity, where do you see these statistics being published?
> Time permitting, I'd be happy to contribute results from my validator.
> I've already been collecting statistics on robots.txt files (an
> obscure hobby to be sure).
> 
> If anyone else is interested in the robots.txt files, the most recent
> data is here:
> http://NikitaTheSpider.com/articles/RobotsTxt2007.html

It will live somewhere on opera.com (I work in QA at Opera)

I found this data very interesting. But it might not intersect that well 
with what I was looking at...actually, I didn't respect robots.txt in my 
crawling. [maybe for that reason, the two studies complement each other 
=)] Not consulting robots.txt was an omission on my part at first, but 
when I considered the issue, I decided to keep using the process I 
already had in place.

- The entire set of URLs was randomized, so the chance of violating a 
robots.txt crawl delay was pretty low.

- The crawl used the DMoz URL set, with domain limiting (a cap of 30 
URLs per domain). This would avoid hammering any server.

I'd love to discuss more about any potential cross-talk between these 
studies though.

-Brian

Received on Thursday, 28 February 2008 03:10:10 UTC