W3C home > Mailing lists > Public > public-bioschemas@w3.org > January 2019

Fwd: WebDataCommons releases 31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites

From: Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk>
Date: Tue, 22 Jan 2019 20:40:10 +0000
To: "public-bioschemas@w3.org" <public-bioschemas@w3.org>
Message-ID: <6C464F60-1C62-4809-A0DC-6F1A5F75BAC7@hw.ac.uk>
Of interest to the Bioschemas community.

Begin forwarded message:

From: Anna Primpeli <anna@informatik.uni-mannheim.de<mailto:anna@informatik.uni-mannheim.de>>
Subject: WebDataCommons releases 31.5 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 9.6 million websites
Date: 17 January 2019 at 09:23:08 CET
To: <semantic-web@w3.org<mailto:semantic-web@w3.org>>, <public-schemaorg@w3.org<mailto:public-schemaorg@w3.org>>, <public-vocabs@w3.org<mailto:public-vocabs@w3.org>>
Resent-From: <semantic-web@w3.org<mailto:semantic-web@w3.org>>

Hi all,
we are happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus.
The data has been extracted from the November 2018 version of the Common Crawl covering 2.5 billion HTML pages which originate from 32 million websites (pay-level domains).
In summary, we found structured data within 900 million HTML pages out of the 2.5 billion pages contained in the crawl (37.1%). These pages originate from 9.6 million different pay-level domains out of the 32.8 million pay-level-domains covered by the crawl (29.3%).
Approximately 5.1 million of these websites use Microdata, 3.8 million websites use JSON-LD, and 1.3 million websites make use of RDFa. Microformats are used by more than 3.3 million websites within the crawl.


More and more websites annotate data describing for instance products, people, organizations, places, events, reviews, and cooking  recipes within their HTML pages using markup formats such as Microdata, embedded JSON-LD, RDFa and Microformat.

The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and Microformat data from the Common Crawl web corpus, the largest web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. We run yearly extractions since 2012 and we provide the dataset series as well as the related statistics at:

Statistics about the November 2018 Release:
Basic statistics about the November 2018 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used together with each markup format are found at:


Markup Format Adoption
The page below provides an overview of the increase in the adoption of the different markup formats as well as widely used schema.org<http://schema.org/> classes from 2012 to 2018:
Comparing the statistics from the new 2018 release to the statistics about the November 2017 release of the data sets
we see that the adoption of structured data keeps on increasing while Microdata remains the most dominant markup syntax. Differences in the crawling strategies that were used for the two crawls make it difficult to directly compare absolute as well as certain relative numbers. More concretely, we observe that the November 2018 Common Crawl corpus is shallower but wider, as fewer URLs from more PLDs are crawled compared to the November 2017 Common Crawl corpus. Nevertheless, it is clear that the growth rates of Microdata and embedded JSON-LD are much higher than the one of RDFa. Comparing the number of PLDs per markup format for certain classes, we observe that there is a tendency to use specific annotation formats for some domains in comparison to others. For example, for annotating data about organizations and persons, JSON-LD format is more widely used whereas the Microdata format is preferred for annotating product and event data.

Vocabulary Adoption
Concerning the vocabulary adoption, schema.org<http://schema.org/>, the vocabulary recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the most dominant in the context of Microdata with 75% of the webmasters using it in comparison to its predecessor, the data-vocabulary, which is only used by 13% of the websites containing Microdata. In the context of RDFa, the Open Graph Protocol recommended by Facebook remains the most widely used vocabulary. The file below analyzes the adoption of schema.org<http://schema.org/> terms that have been newly introduced in the last two years. The file also provides statistics on how many websites use specific schema.org<http://schema.org/> classes together with the JSON-LD and Microdata syntax.

The overall size of the November 2018 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 31.5 billion RDF quads. For download, we split the data into 7,263 files with a total size of 728 GB.
In addition, we have created for over 40 different schema.org<http://schema.org/> classes separate files, including all quads extracted from pages, using a specific schema.org<http://schema.org/> class.

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.
+ the Any23 project for providing their great library of structured data parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.

Training Dataset and Gold Standard for Large-Scale Product Matching
As a side note on what else is happening in the Web Data Commons project around schema.org<http://schema.org/> data: Using the November 2017 schema.org<http://schema.org/> Product data corpus, we created a training dataset and gold standard for large-scale product matching. The training dataset consists of more than 26 million product offers originating from 79 thousand websites that use schema.org<http://schema.org/> annotations. Using annotated identifiers such as MPN and GTINs, we grouped the offers into 16 million clusters with each cluster referring to the same real-world product. The gold standard consists of 2000 pairs of offers which were manually verified as matches or non-matches. We provide the training dataset and gold standard for public download thus hoping to contribute to improving the evaluation and comparison of different entity matching algorithms.

General Information about the WebDataCommons Project
The WebDataCommons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Beside of the yearly extractions of semantic annotations from webpages, the WebDataCommons project also provides large hyperlink graphs, the largest public corpus of web tables, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

Have fun with the new data set.


Anna Primpeli, Robert Meusel and Chris Bizer

Alasdair J G Gray
Associate Professor in Computer Science,
School of Mathematical and Computer Sciences
Heriot-Watt University, Edinburgh, UK.

Email: A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>
Web: http://www.macs.hw.ac.uk/~ajg33
ORCID: http://orcid.org/0000-0002-5711-4872
Office: Earl Mountbatten Building 1.39
Twitter: @gray_alasdair

To arrange a meeting: http://doodle.com/ajggray


Heriot-Watt University is The Times & The Sunday Times International University of the Year 2018

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes:

  1.  Heriot-Watt University, a Scottish charity registered under number SC000278
  2.  Edinburgh Business School a Charity Registered in Scotland, SC026900. Edinburgh Business School is a company limited by guarantee, registered in Scotland with registered number SC173556 and registered office at Heriot-Watt University Finance Office, Riccarton, Currie, Midlothian, EH14 4AS
  3.  Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.
Received on Tuesday, 22 January 2019 20:40:39 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:08:07 UTC