W3C home > Mailing lists > Public > public-vocabs@w3.org > July 2012

Re: Vocabulary Usage on Web Pages - Analysis Results

From: Tom Morris <tfmorris@gmail.com>
Date: Mon, 2 Jul 2012 12:20:02 -0400
Message-ID: <CAE9vqEE0rUd+4TDQchYhnAFe3XXEm4q3wAX1rCegJo7VkNUWFw@mail.gmail.com>
To: Dan Brickley <danbri@danbri.org>
Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, Hannes Mühleisen <muehleis@inf.fu-berlin.de>, public-vocabs@w3.org
On Mon, Jul 2, 2012 at 6:10 AM, Dan Brickley <danbri@danbri.org> wrote:
> On 2 July 2012 11:43, Martin Hepp <martin.hepp@ebusiness-unibw.org> wrote:
>> Dear Hannes:
>>
>> I would like to stress again what was discussed on various mailing lists in April 2012, i.e. that the data basis for webdatacommons.org is highly problematic, since the underlying CommonCrawl corpus does not include the majority of deep links into dynamic Web applications and thus misses the core of where RDFa and Microdata typically sits.
>>
>> See
>>
>> http://yfrog.com/h3z75np
>>
>> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0117.html
>> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0103.html
>
> Yes, this is a real issue - it's hard to interpret the counts without
> knowing which sites (and subsections of sites) are included, and what
> the selection criteria were. e.g. What do we know about the
> CommonCrawl crawling strategy?

Personally, I'd be a lot more impressed with additional data showing
just how popular these. supposedly underrepresented, vocabularies are
rather than just complaints about the sampling strategy.

Tom
Received on Monday, 2 July 2012 16:20:34 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 2 July 2012 16:20:35 GMT