Re: Bug 85/4494 (keeping track of validation statistics for various purposes) from Nikita The Spider The Spider on 2008-03-01 (www-validator@w3.org from March 2008)

From: Nikita The Spider The Spider <nikitathespider@gmail.com>
Date: Sat, 1 Mar 2008 10:35:50 -0500
To: "Brian Wilson" <bloo@blooberry.com>
Cc: www-validator@w3.org
Message-ID: <35e76ac10803010735w64f4badax68d038ce05edc6ca@mail.gmail.com>

On Wed, Feb 27, 2008 at 10:09 PM, Brian Wilson <bloo@blooberry.com> wrote:
>
>  [this got lost in the shuffle, many sorries for the delay]
>
>  Nikita The Spider The Spider wrote:
>  > On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote:
>  >> On Wed, 6 Feb 2008, olivier Thereaux wrote:
>  >>
>  >>> * stats on the documents themselves. Doctype, mime type, charset.
>  >>> Ideally, whether charset is in HTTP, XML decl, meta. There are
>  >>> existing studies about these, but another study made on a different
>  >>> sample would bring more perspective.
>  >
>  > Out of curiousity, where do you see these statistics being published?
>  > Time permitting, I'd be happy to contribute results from my validator.
>  > I've already been collecting statistics on robots.txt files (an
>  > obscure hobby to be sure).
>  >
>  > If anyone else is interested in the robots.txt files, the most recent
>  > data is here:
>  > http://NikitaTheSpider.com/articles/RobotsTxt2007.html
>
>  It will live somewhere on opera.com (I work in QA at Opera)
>
>  I found this data very interesting. But it might not intersect that well
>  with what I was looking at...actually, I didn't respect robots.txt in my
>  crawling. [maybe for that reason, the two studies complement each other
>  =)] Not consulting robots.txt was an omission on my part at first, but
>  when I considered the issue, I decided to keep using the process I
>  already had in place.
>
>  - The entire set of URLs was randomized, so the chance of violating a
>  robots.txt crawl delay was pretty low.
>
>  - The crawl used the DMoz URL set, with domain limiting (a cap of 30
>  URLs per domain). This would avoid hammering any server.
>
>  I'd love to discuss more about any potential cross-talk between these
>  studies though.

Brian,
You're welcome to copy the robots.txt articles and/or data. They're
under a Creative Commons license specified at the bottom of each
article. Obviously you can also link to them as well. Is there
anything else you'd need?

With regards to the robots.txt and validation data, it is pretty easy
for me to gather statistics. I already have all of the data
(validation messages, doctypes, source of charset info, etc.) stored
in a database. Lately I've done a little Google advertising and it has
been interesting to see the crowd it draws -- the incidence of pages
with no doctype has gone up. This suggests to me that the keywords I'm
 using are finding newcomers to validation.

I still don't know how to reach my target market which isn't really
novices (although they're welcome) but Webmasters of large sites who
want their pages to be valid but can't possibly hand-check them all.
Once I start attracting them, my statistics will change.

Cheers

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Received on Saturday, 1 March 2008 15:36:00 UTC