- From: Dan Brickley <danbri@danbri.org>
- Date: Mon, 2 Jul 2012 12:10:50 +0200
- To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
- Cc: Hannes Mühleisen <muehleis@inf.fu-berlin.de>, public-vocabs@w3.org
On 2 July 2012 11:43, Martin Hepp <martin.hepp@ebusiness-unibw.org> wrote: > Dear Hannes: > > I would like to stress again what was discussed on various mailing lists in April 2012, i.e. that the data basis for webdatacommons.org is highly problematic, since the underlying CommonCrawl corpus does not include the majority of deep links into dynamic Web applications and thus misses the core of where RDFa and Microdata typically sits. > > See > > http://yfrog.com/h3z75np > > http://lists.w3.org/Archives/Public/public-lod/2012Apr/0117.html > http://lists.w3.org/Archives/Public/public-lod/2012Apr/0103.html Yes, this is a real issue - it's hard to interpret the counts without knowing which sites (and subsections of sites) are included, and what the selection criteria were. e.g. What do we know about the CommonCrawl crawling strategy? On the other hand, some members of the Semantic Web community have been known to speak out against over-enthusiastic crawlers, and to argue "there is rarely a need to crawl the complete dataset" since the same data is available via SPARQL - http://lists.w3.org/Archives/Public/public-lod/2010Jun/0117.html On the same logic there is also a lot of important data only available in non-HTML RDF notations, Turtle, RDF/XML etc. (or as you say via SPARQL). For some purposes these aren't relevant to studies of in-HTML structured data; but care needs to be taken when generalising from the HTML-extracted stuff to the broader landscape. cheers, Dan
Received on Monday, 2 July 2012 10:11:22 UTC