- From: Tom Morris <tfmorris@gmail.com>
- Date: Mon, 2 Jul 2012 12:20:02 -0400
- To: Dan Brickley <danbri@danbri.org>
- Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, Hannes Mühleisen <muehleis@inf.fu-berlin.de>, public-vocabs@w3.org
On Mon, Jul 2, 2012 at 6:10 AM, Dan Brickley <danbri@danbri.org> wrote: > On 2 July 2012 11:43, Martin Hepp <martin.hepp@ebusiness-unibw.org> wrote: >> Dear Hannes: >> >> I would like to stress again what was discussed on various mailing lists in April 2012, i.e. that the data basis for webdatacommons.org is highly problematic, since the underlying CommonCrawl corpus does not include the majority of deep links into dynamic Web applications and thus misses the core of where RDFa and Microdata typically sits. >> >> See >> >> http://yfrog.com/h3z75np >> >> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0117.html >> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0103.html > > Yes, this is a real issue - it's hard to interpret the counts without > knowing which sites (and subsections of sites) are included, and what > the selection criteria were. e.g. What do we know about the > CommonCrawl crawling strategy? Personally, I'd be a lot more impressed with additional data showing just how popular these. supposedly underrepresented, vocabularies are rather than just complaints about the sampling strategy. Tom
Received on Monday, 2 July 2012 16:20:34 UTC