WebDataCommons releases 44.2 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 5.6 million pay-level-domains from Anna Primpeli on 2017-01-20 (semantic-web@w3.org from January 2017)

From: Anna Primpeli <anna@informatik.uni-mannheim.de>
Date: Fri, 20 Jan 2017 13:23:46 +0100
To: <semantic-web@w3.org>, <public-vocabs@w3.org>
Message-ID: <008701d27318$0be65720$23b30560$@informatik.uni-mannheim.de>
Hi all,

 

 

we are happy to announce a new release of the WebDataCommons Microdata,
Embedded JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2016 version of the CommonCrawl
covering 3.2 billion HTML pages which originate from 34 million websites
(pay-level domains).

Altogether we discovered structured data within 1.2 billion HTML pages out
of the 3.2 billion pages contained in the crawl (38%). These pages originate
from 5.6 million different pay-level domains out of the 34 billion pay-level
domains covered by the crawl (16.5%).

Approximately 2.5 million of these websites use Microdata, 2.1 million
websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats
are used by over 1.6 million websites within the crawl.

 

Background: 

More and more websites annotate structured data within their HTML pages
using markup formats such as RDFa, Microdata, embedded JSON-LD and
Microformats. The annotations  cover topics such as products, reviews,
people, organizations, places, events, and cooking  recipes.

The WebDataCommons project extracts all Microdata, RDFa data, and
Microformat data, and since 2015 also embedded JSON-LD data from the Common
Crawl web corpus, the largest and most up-to-date web corpus that is
available to the public, and provides the extracted data for download. In
addition, we publish statistics about the adoption of the different markup
formats as well as the vocabularies that are used together with each format.


Besides the markup data, the WebDataCommons project also provides large web
table corpora and web graphs for download. General information about the
WebDataCommons project is found at 

 <http://webdatacommons.org/> http://webdatacommons.org/ 


Data Set Statistics: 

Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa  
and Microformat data sets as well as the vocabularies that are used together
with each 
markup format are found at: 

 <http://webdatacommons.org/structureddata/2016-10/stats/stats.html>
http://webdatacommons.org/structureddata/2016-10/stats/stats.html

Comparing the statistics to the statistics about the November 2015 release
of the data sets

 

 <http://webdatacommons.org/structureddata/2015-11/stats/stats.html>
http://webdatacommons.org/structureddata/2015-11/stats/stats.html

we see that the Microdata syntax remains the most dominant annotation
format. Although it is hard to compare the adoption of the syntax between
the two years in absolute numbers, as the October 2016 crawl corpus is
almost double the size of the November 2015 one, a relative increase can be
observed: In the October 2016 corpus over 44% of the pay-level domains
containing markup data make use of the Microdata syntax in comparison to 40%
one year earlier. Even though the absolute numbers concerning the RDFa
markup syntax adoption rise, the relative increase does not follow up the
increase of the corpus size indicating that RDFa is less used by the
websites. Similar to the 2015 release, the adoption of embedded JSON-LD has
considerably increased, even though the main focus of the annotation remains
the search action offered by the websites (70%).

As already observed in the previous years, the  <http://schema.org/>
schema.org vocabulary is most frequently used in the context of Microdata
while the adoption of its predecessor, the data vocabulary, continues to
decrease. In the context of RDFa, we still find the Open Graph Protocol
recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue. We see
that beside of navigational, blog and CMS related meta-information, many
websites annotate e-commerce related data (Products, Offers, and Reviews) as
well as contact information (LocalBusiness, Organization, PostalAddress).
More concretely, the October 2016 corpus includes more than 682 million
product records originating from 249 thousand websites which use the
<http://schema.org/> schema.org vocabulary. The new release contains postal
address data for more than 291 million entities originating from 338
thousand websites. Furthermore, the content describing hotels has doubled in
size in this release, with a total of 61 million hotel descriptions.

Visualizations of the main adoption trends concerning the different
annotation formats, popular  <http://schema.org/> schema.org, as well as
RDFa classes within the time span 2012 to 2016 are found at

 <http://webdatacommons.org/structureddata/#toc8>
http://webdatacommons.org/structureddata/#toc8

 

Download:

The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, and
Microformat data sets is 44.2 billion RDF quads. For download, we split the
data into 9,661 files with a total size of 987 GB. 

 
<http://webdatacommons.org/structureddata/2016-10/stats/how_to_get_the_data.
html>
http://webdatacommons.org/structureddata/2016-10/stats/how_to_get_the_data.h
tml

In addition, we have created for over 40 different  <http://schema.org/>
schema.org classes separate files, including all quads from pages, deploying
at least once the specific class. 

 
<http://webdatacommons.org/structureddata/2016-10/stats/schema_org_subsets.h
tml>
http://webdatacommons.org/structureddata/2016-10/stats/schema_org_subsets.ht
ml

 

Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus
enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured data
parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 
+ the Ministry of Economy, Research and Arts of Baden – Württemberg which
supported by means of the ViCe project the extraction and analysis of the
October 2016 corpus.


Have fun with the new data set. 

Cheers, 
Anna Primpeli, Robert Meusel, and Chris Bizer
Received on Friday, 20 January 2017 12:24:17 UTC