Re: Vocabulary Usage on Web Pages - Analysis Results

Dear Martin,

On 02.07.2012, at 11:43, Martin Hepp wrote:
> I would like to stress again what was discussed on various mailing lists in April 2012, i.e. that the data basis for webdatacommons.org is highly problematic, since the underlying CommonCrawl corpus does not include the majority of deep links into dynamic Web applications and thus misses the core of where RDFa and Microdata typically sits.

As previously stated, we are aware of your viewpoint and the mentioned issues. However, we maintain that the Common Crawl data set is so far the best source to study Web developments, mainly due to its large size of around five percent of the indexed web. Of course, every selection mechanism to reduce the result data set size is problematic. As already explained, PageRank is used to determine domains that are crawled more thoroughly. We feel this compromise to be acceptable for the time being. For example, we have found 133,541 URLs with the GoodRelations vocabulary from 1,638 domains, which shows how some domains are indeed crawled deeply. 

We welcome further analysis on vocabulary usage, regardless of the data corpus used, and are always eager to compare and discuss results.

Best,

Hannes

Received on Monday, 2 July 2012 10:06:11 UTC