Re: Web Dev Data from Alex Russell on 2013-05-17 (public-closingthegap@w3.org from May 2013)

From: Alex Russell <slightlyoff@google.com>
Date: Fri, 17 May 2013 15:45:36 +0100
To: Marcos Caceres <w3c@marcosc.com>
Cc: Steve Faulkner <faulkner.steve@gmail.com>, Marcos Caceres <marcos@marcosc.com>, "public-closingthegap@w3.org" <public-closingthegap@w3.org>, "dom@w3.org" <dom@w3.org>, Pieters Simon <simonp@opera.com>, "yoav@yoav.ws" <yoav@yoav.ws>
Message-ID: <CANr5HFVBeDMa2AqLCPbApJaV8A9Ox2pn2sxVCODh1s1rM-H9rQ@mail.gmail.com>

Have we reached out to them to ask how we might collaborate on a simpler
interface to their already-rich data set?

I'm all for simpler tools that fill the "I need to quickly scan N URLs" gap
-- I've written such things myself in the past, but it feels like it would
perhaps be even easier if we could just ask an existing corpus.

On Friday, May 17, 2013, Marcos Caceres wrote:

>
>
> On Friday, May 17, 2013, Steve Faulkner wrote:
>
>> Hi Alex,
>>
>> My immediate thought when looking at both of those URLs is that as an
>> ordinary person I don't know where to start with either of those. What i
>> want (which is why i did it mysyelf) was access to a manageable set of data
>> that I can use to study html feature usage. what i would like is to have
>> easily query up to date data on usage of feature x without having to be a
>> rocket scientist to do it.
>>
>
>
> Agree. I have gone to both of those sites previously and have no idea how
> to do a query. The data sets seem impossibly large, which makes me wonder
> when the law of diminishing returns kicks in on those sets (vs our top 50k
> set, which is large enough to be statistically significant and hopefully
> representative enough without being too western-biased).
>
> So the value proposition: We want to make something that makes sense for
> the spec community and is easy to search and use (or even d/l for local
> searching)... A caniuse.com for tags, attributes, and HTTP headers, etc.
>
> The other guys don't provide that.
>
>
>>
>> --
>>
>> Regards
>>
>> SteveF
>> HTML 5.1 <http://www.w3.org/html/wg/drafts/html/master/>
>>
>>
>> On 17 May 2013 14:31, Alex Russell <slightlyoff@google.com> wrote:
>>
>>> What's the core value proposition for this work vs.
>>> http://commoncrawl.org/ and http://www.webdatacommons.org/ ?
>>>
>>>
>>> On Friday, May 17, 2013, Marcos Caceres wrote:
>>>
>>>> A few of us have a small project (http://webdevdata.org/) that we've
>>>> been using to inform the development of specifications over the last few
>>>> months. It actually started with Steve's research into <main>, for which he
>>>> used some software to crawl a large number of sites, and then grep'd that
>>>> data to get stats that helped support his argument for <main>.
>>>>
>>>> This data set has become increasingly useful to a number of people (the
>>>> RICG has been making extensive use of it), and so have some members of the
>>>> HTMLWG (e.g., [1]).
>>>>
>>>> Anyway, as the headlights activity has the potential to result in the
>>>> allocation of resources for projects, I think it would be good if
>>>> webdevdata.org could be considered as something that can help "close
>>>> the gap" (in that it provides data to help us make informed technical
>>>> decisions about the platform).
>>>>
>>>> What we would like to see:
>>>>
>>>> * monthly or quarterly crawls.
>>>> * hosting and archiving of the data.
>>>> * the ability to search the index through the web.
>>>> * the ability to download the data.
>>>>
>>>> Maybe the W3C could speak to its members in the academic sector for
>>>> help with different ways of searching the data and making statistical
>>>> analysis of it (in a way that helps both Web developers and spec folks).
>>>>
>>>> [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=19619#c21
>>>> --
>>>> Marcos Caceres
>>>>
>>>>
>>>>
>>>>
>>

Received on Friday, 17 May 2013 14:46:07 UTC