Re: Vocabulary Usage on Web Pages - Analysis Results

Dear Hannes:

I would like to stress again what was discussed on various mailing lists in April 2012, i.e. that the data basis for webdatacommons.org is highly problematic, since the underlying CommonCrawl corpus does not include the majority of deep links into dynamic Web applications and thus misses the core of where RDFa and Microdata typically sits.

See

http://yfrog.com/h3z75np

http://lists.w3.org/Archives/Public/public-lod/2012Apr/0117.html
http://lists.w3.org/Archives/Public/public-lod/2012Apr/0103.html

Best

Martin 
On Jul 2, 2012, at 9:19 AM, Hannes Mühleisen wrote:

> Hello Vocabulary Enthusiasts,
> 
> we have recently completed a study on vocabulary usage on Web pages using the Microdata and RDFa encodings. We have analyzed both vocabulary as well as class and property usage frequencies and property co-occurence for two web crawls. These crawls contained 93 Million URLs with data using both encodings from 2012, and 14 Million URLs from 2009/2010. The results are available at http://webdatacommons.org/vocabulary-usage-analysis/index.html .
> 
> We hope our findings are useful in giving a small insight in what vocabularies (or parts thereof) are used to annotate entities within HTML pages.
> 
> Regards,
> 
> Hannes Mühleisen
> 

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/

Received on Monday, 2 July 2012 09:43:35 UTC