- From: Brian Wilson <bloo@blooberry.com>
- Date: Thu, 28 Feb 2008 04:09:38 +0100
- To: Nikita The Spider The Spider <nikitathespider@gmail.com>
- CC: www-validator@w3.org
[this got lost in the shuffle, many sorries for the delay] Nikita The Spider The Spider wrote: > On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote: >> On Wed, 6 Feb 2008, olivier Thereaux wrote: >> >>> * stats on the documents themselves. Doctype, mime type, charset. >>> Ideally, whether charset is in HTTP, XML decl, meta. There are >>> existing studies about these, but another study made on a different >>> sample would bring more perspective. > > Out of curiousity, where do you see these statistics being published? > Time permitting, I'd be happy to contribute results from my validator. > I've already been collecting statistics on robots.txt files (an > obscure hobby to be sure). > > If anyone else is interested in the robots.txt files, the most recent > data is here: > http://NikitaTheSpider.com/articles/RobotsTxt2007.html It will live somewhere on opera.com (I work in QA at Opera) I found this data very interesting. But it might not intersect that well with what I was looking at...actually, I didn't respect robots.txt in my crawling. [maybe for that reason, the two studies complement each other =)] Not consulting robots.txt was an omission on my part at first, but when I considered the issue, I decided to keep using the process I already had in place. - The entire set of URLs was randomized, so the chance of violating a robots.txt crawl delay was pretty low. - The crawl used the DMoz URL set, with domain limiting (a cap of 30 URLs per domain). This would avoid hammering any server. I'd love to discuss more about any potential cross-talk between these studies though. -Brian
Received on Thursday, 28 February 2008 03:10:10 UTC