Re: Vocabulary Usage on Web Pages - Analysis Results from Tom Morris on 2012-07-02 (public-vocabs@w3.org from July 2012)

From: Tom Morris <tfmorris@gmail.com>
Date: Mon, 2 Jul 2012 12:20:02 -0400
To: Dan Brickley <danbri@danbri.org>
Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, Hannes Mühleisen <muehleis@inf.fu-berlin.de>, public-vocabs@w3.org
Message-ID: <CAE9vqEE0rUd+4TDQchYhnAFe3XXEm4q3wAX1rCegJo7VkNUWFw@mail.gmail.com>

On Mon, Jul 2, 2012 at 6:10 AM, Dan Brickley <danbri@danbri.org> wrote:
> On 2 July 2012 11:43, Martin Hepp <martin.hepp@ebusiness-unibw.org> wrote:
>> Dear Hannes:
>>
>> I would like to stress again what was discussed on various mailing lists in April 2012, i.e. that the data basis for webdatacommons.org is highly problematic, since the underlying CommonCrawl corpus does not include the majority of deep links into dynamic Web applications and thus misses the core of where RDFa and Microdata typically sits.
>>
>> See
>>
>> http://yfrog.com/h3z75np
>>
>> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0117.html
>> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0103.html
>
> Yes, this is a real issue - it's hard to interpret the counts without
> knowing which sites (and subsections of sites) are included, and what
> the selection criteria were. e.g. What do we know about the
> CommonCrawl crawling strategy?

Personally, I'd be a lot more impressed with additional data showing
just how popular these. supposedly underrepresented, vocabularies are
rather than just complaints about the sampling strategy.

Tom

Received on Monday, 2 July 2012 16:20:34 UTC