Re: Web Dev Data

On Friday, May 17, 2013, Steve Faulkner wrote:

> Hi Alex,
>
> My immediate thought when looking at both of those URLs is that as an
> ordinary person I don't know where to start with either of those. What i
> want (which is why i did it mysyelf) was access to a manageable set of data
> that I can use to study html feature usage. what i would like is to have
> easily query up to date data on usage of feature x without having to be a
> rocket scientist to do it.
>


Agree. I have gone to both of those sites previously and have no idea how
to do a query. The data sets seem impossibly large, which makes me wonder
when the law of diminishing returns kicks in on those sets (vs our top 50k
set, which is large enough to be statistically significant and hopefully
representative enough without being too western-biased).

So the value proposition: We want to make something that makes sense for
the spec community and is easy to search and use (or even d/l for local
searching)... A caniuse.com for tags, attributes, and HTTP headers, etc.

The other guys don't provide that.


>
> --
>
> Regards
>
> SteveF
> HTML 5.1 <http://www.w3.org/html/wg/drafts/html/master/>
>
>
> On 17 May 2013 14:31, Alex Russell <slightlyoff@google.com<javascript:_e({}, 'cvml', 'slightlyoff@google.com');>
> > wrote:
>
>> What's the core value proposition for this work vs.
>> http://commoncrawl.org/ and http://www.webdatacommons.org/ ?
>>
>>
>> On Friday, May 17, 2013, Marcos Caceres wrote:
>>
>>> A few of us have a small project (http://webdevdata.org/) that we've
>>> been using to inform the development of specifications over the last few
>>> months. It actually started with Steve's research into <main>, for which he
>>> used some software to crawl a large number of sites, and then grep'd that
>>> data to get stats that helped support his argument for <main>.
>>>
>>> This data set has become increasingly useful to a number of people (the
>>> RICG has been making extensive use of it), and so have some members of the
>>> HTMLWG (e.g., [1]).
>>>
>>> Anyway, as the headlights activity has the potential to result in the
>>> allocation of resources for projects, I think it would be good if
>>> webdevdata.org could be considered as something that can help "close
>>> the gap" (in that it provides data to help us make informed technical
>>> decisions about the platform).
>>>
>>> What we would like to see:
>>>
>>> * monthly or quarterly crawls.
>>> * hosting and archiving of the data.
>>> * the ability to search the index through the web.
>>> * the ability to download the data.
>>>
>>> Maybe the W3C could speak to its members in the academic sector for help
>>> with different ways of searching the data and making statistical analysis
>>> of it (in a way that helps both Web developers and spec folks).
>>>
>>> [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=19619#c21
>>> --
>>> Marcos Caceres
>>>
>>>
>>>
>>>
>

Received on Friday, 17 May 2013 14:22:53 UTC