WebDataCommons releases 31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites

Hi all,

we are happy to announce the new release of the WebDataCommons Microdata,
JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the November 2018 version of the Common
Crawl covering 2.5 billion HTML pages which originate from 32 million
websites (pay-level domains).

In summary, we found structured data within 900 million HTML pages out of
the 2.5 billion pages contained in the crawl (37.1%). These pages originate
from 9.6 million different pay-level domains out of the 32.8 million
pay-level-domains covered by the crawl (29.3%). 

Approximately 5.1 million of these websites use Microdata, 3.8 million
websites use JSON-LD, and 1.3 million websites make use of RDFa.
Microformats are used by more than 3.3 million websites within the crawl.

 

Background: 

More and more websites annotate data describing for instance products,
people, organizations, places, events, reviews, and cooking  recipes within
their HTML pages using markup formats such as Microdata, embedded JSON-LD,
RDFa and Microformat. 

The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and
Microformat data from the Common Crawl web corpus, the largest web corpus
that is available to the public, and provides the extracted data for
download. In addition, we publish statistics about the adoption of the
different markup formats as well as the vocabularies that are used together
with each format. We run yearly extractions since 2012 and we provide the
dataset series as well as the related statistics at:

http://webdatacommons.org/structureddata/

 

Statistics about the November 2018 Release:

Basic statistics about the November 2018 Microdata, JSON-LD, RDFa, and
Microformat data sets as well as the vocabularies that are used together
with each markup format are found at: 

http://webdatacommons.org/structureddata/2018-12/stats/stats.html

 

Markup Format Adoption

The page below provides an overview of the increase in the adoption of the
different markup formats as well as widely used schema.org classes from 2012
to 2018:

http://webdatacommons.org/structureddata/#toc3 

Comparing the statistics from the new 2018 release to the statistics about
the November 2017 release of the data sets

http://webdatacommons.org/structureddata/2017-12/stats/stats.html

we see that the adoption of structured data keeps on increasing while
Microdata remains the most dominant markup syntax. Differences in the
crawling strategies that were used for the two crawls make it difficult to
directly compare absolute as well as certain relative numbers. More
concretely, we observe that the November 2018 Common Crawl corpus is
shallower but wider, as fewer URLs from more PLDs are crawled compared to
the November 2017 Common Crawl corpus. Nevertheless, it is clear that the
growth rates of Microdata and embedded JSON-LD are much higher than the one
of RDFa. Comparing the number of PLDs per markup format for certain classes,
we observe that there is a tendency to use specific annotation formats for
some domains in comparison to others. For example, for annotating data about
organizations and persons, JSON-LD format is more widely used whereas the
Microdata format is preferred for annotating product and event data.

 

Vocabulary Adoption

Concerning the vocabulary adoption, schema.org, the vocabulary recommended
by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant
in the context of Microdata with 75% of the webmasters using it in
comparison to its predecessor, the data-vocabulary, which is only used by
13% of the websites containing Microdata. In the context of RDFa, the Open
Graph Protocol recommended by Facebook remains the most widely used
vocabulary. The file below analyzes the adoption of schema.org terms that
have been newly introduced in the last two years. The file also provides
statistics on how many websites use specific schema.org classes together
with the JSON-LD and Microdata syntax. 

http://webdatacommons.org/structureddata/2018-12/stats/md-jsonld-comparison.
xlsx

 

Download 

The overall size of the November 2018 RDFa, Microdata, Embedded JSON-LD and
Microformat data sets is 31.5 billion RDF quads. For download, we split the
data into 7,263 files with a total size of 728 GB.

http://webdatacommons.org/structureddata/2018-12/stats/how_to_get_the_data.h
tml

In addition, we have created for over 40 different  <http://schema.org/>
schema.org classes separate files, including all quads extracted from pages,
using a specific schema.org class. 

http://webdatacommons.org/structureddata/2018-12/stats/schema_org_subsets.ht
ml

Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus
enabling the WebDataCommons project. 
+ the Any23 project for providing their great library of structured data
parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 

Training Dataset and Gold Standard for Large-Scale Product Matching

As a side note on what else is happening in the Web Data Commons project
around schema.org data: Using the November 2017 schema.org Product data
corpus, we created a training dataset and gold standard for large-scale
product matching. The training dataset consists of more than 26 million
product offers originating from 79 thousand websites that use schema.org
annotations. Using annotated identifiers such as MPN and GTINs, we grouped
the offers into 16 million clusters with each cluster referring to the same
real-world product. The gold standard consists of 2000 pairs of offers which
were manually verified as matches or non-matches. We provide the training
dataset and gold standard for public download thus hoping to contribute to
improving the evaluation and comparison of different entity matching
algorithms.

http://webdatacommons.org/largescaleproductcorpus/index.html

 

General Information about the WebDataCommons Project

The WebDataCommons project extracts structured data from the Common Crawl,
the largest web corpus available to the public, and provides the extracted
data for public download in order to support researchers and companies in
exploiting the wealth of information that is available on the Web. Beside of
the yearly extractions of semantic annotations from webpages, the
WebDataCommons project also provides large hyperlink graphs, the largest
public corpus of web tables, two corpora of product data, as well as a
collection of hypernyms extracted from billions of web pages for public
download. General information about the WebDataCommons project is found at 

http://webdatacommons.org/


Have fun with the new data set. 

Cheers, 


Anna Primpeli, Robert Meusel and Chris Bizer

 

Received on Thursday, 17 January 2019 08:23:37 UTC