Re: Vocabulary Usage on Web Pages - Analysis Results

On 2 July 2012 11:43, Martin Hepp <martin.hepp@ebusiness-unibw.org> wrote:
> Dear Hannes:
>
> I would like to stress again what was discussed on various mailing lists in April 2012, i.e. that the data basis for webdatacommons.org is highly problematic, since the underlying CommonCrawl corpus does not include the majority of deep links into dynamic Web applications and thus misses the core of where RDFa and Microdata typically sits.
>
> See
>
> http://yfrog.com/h3z75np
>
> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0117.html
> http://lists.w3.org/Archives/Public/public-lod/2012Apr/0103.html

Yes, this is a real issue - it's hard to interpret the counts without
knowing which sites (and subsections of sites) are included, and what
the selection criteria were. e.g. What do we know about the
CommonCrawl crawling strategy?

On the other hand, some members of the Semantic Web community have
been known to speak out against over-enthusiastic crawlers, and to
argue "there is rarely a need to crawl the complete dataset" since the
same data is available via SPARQL -
http://lists.w3.org/Archives/Public/public-lod/2010Jun/0117.html

On the same logic there is also a lot of important data only available
in non-HTML RDF notations, Turtle, RDF/XML etc. (or as you say via
SPARQL). For some purposes these aren't relevant to studies of in-HTML
structured data; but care needs to be taken when generalising from the
HTML-extracted stuff to the broader landscape.

cheers,

Dan

Received on Monday, 2 July 2012 10:11:22 UTC