Re: Web Dev Data from Marcos Caceres on 2013-05-17 (public-closingthegap@w3.org from May 2013)

From: Marcos Caceres <marcos@marcosc.com>
Date: Fri, 17 May 2013 17:57:22 +0100
To: Alex Russell <slightlyoff@google.com>
Cc: Marcos Caceres <w3c@marcosc.com>, Steve Faulkner <faulkner.steve@gmail.com>, "public-closingthegap@w3.org" <public-closingthegap@w3.org>, "dom@w3.org" <dom@w3.org>, Pieters Simon <simonp@opera.com>, "yoav@yoav.ws" <yoav@yoav.ws>
Message-ID: <E5BCF627A59E4568A0636EDE7861AFDA@marcosc.com>

On Friday, 17 May 2013 at 15:45, Alex Russell wrote:

> Have we reached out to them to ask how we might collaborate on a simpler interface to their already-rich data set?

No. Of course not! we were too busy duplicating data that already exists :)
> I'm all for simpler tools that fill the "I need to quickly scan N URLs" gap -- I've written such things myself in the past, but it feels like it would perhaps be even easier if we could just ask an existing corpus.

Yeah, why not? It's worth a shot.   
> On Friday, May 17, 2013, Marcos Caceres wrote:
> > 
> > 
> > On Friday, May 17, 2013, Steve Faulkner wrote:
> > > Hi Alex,
> > > 
> > > My immediate thought when looking at both of those URLs is that as an ordinary person I don't know where to start with either of those. What i want (which is why i did it mysyelf) was access to a manageable set of data that I can use to study html feature usage. what i would like is to have easily query up to date data on usage of feature x without having to be a rocket scientist to do it. 
> > 
> > 
> > Agree. I have gone to both of those sites previously and have no idea how to do a query. The data sets seem impossibly large, which makes me wonder when the law of diminishing returns kicks in on those sets (vs our top 50k set, which is large enough to be statistically significant and hopefully representative enough without being too western-biased). 
> > 
> > So the value proposition: We want to make something that makes sense for the spec community and is easy to search and use (or even d/l for local searching)... A caniuse.com (http://caniuse.com) for tags, attributes, and HTTP headers, etc. 
> > 
> > The other guys don't provide that. 
> > 
> > > 
> > > --
> > > 
> > > Regards
> > > 
> > > SteveF
> > > HTML 5.1 (http://www.w3.org/html/wg/drafts/html/master/)
> > > 
> > > 
> > > On 17 May 2013 14:31, Alex Russell <slightlyoff@google.com> wrote:
> > > > What's the core value proposition for this work vs. http://commoncrawl.org/ and http://www.webdatacommons.org/ ?
> > > > 
> > > > 
> > > > On Friday, May 17, 2013, Marcos Caceres wrote:
> > > > > A few of us have a small project (http://webdevdata.org/) that we've been using to inform the development of specifications over the last few months. It actually started with Steve's research into <main>, for which he used some software to crawl a large number of sites, and then grep'd that data to get stats that helped support his argument for <main>.
> > > > > 
> > > > > This data set has become increasingly useful to a number of people (the RICG has been making extensive use of it), and so have some members of the HTMLWG (e.g., [1]).
> > > > > 
> > > > > Anyway, as the headlights activity has the potential to result in the allocation of resources for projects, I think it would be good if webdevdata.org (http://webdevdata.org) could be considered as something that can help "close the gap" (in that it provides data to help us make informed technical decisions about the platform).
> > > > > 
> > > > > What we would like to see:
> > > > > 
> > > > > * monthly or quarterly crawls.
> > > > > * hosting and archiving of the data.
> > > > > * the ability to search the index through the web.
> > > > > * the ability to download the data.
> > > > > 
> > > > > Maybe the W3C could speak to its members in the academic sector for help with different ways of searching the data and making statistical analysis of it (in a way that helps both Web developers and spec folks).
> > > > > 
> > > > > [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=19619#c21
> > > > > --
> > > > > Marcos Caceres
> > > > 
> > > 
> > 
>

Received on Friday, 17 May 2013 16:57:57 UTC