Re: Fwd: WebDataCommons releases 38.7 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 7.4 million pay-level-domains

Hi,

Thanks for forwarding Alasdair, looks really nice. The stats at 
http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html 
are interesting to see the usage per class.

Is it worth looking at their code and see whether we can reuse anything? 
They seem to be using standard tools such as Any23, and the bottom of 
the page above links to an SVN repository.

Cheers,
Melanie

On 12/01/2018 09:05, Gray, Alasdair J G wrote:
> An interesting analysis of the markup available on the common crawl.
>
> Alasdair
>
>> Begin forwarded message:
>>
>> *From: *Anna Primpeli <anna@informatik.uni-mannheim.de 
>> <mailto:anna@informatik.uni-mannheim.de>>
>> *Subject: **WebDataCommons releases 38.7 billion quads Microdata, 
>> Embedded JSON-LD, RDFa and Microformat data originating from 7.4 
>> million pay-level-domains*
>> *Date: *11 January 2018 at 09:35:20 GMT
>> *To: *<semantic-web@w3.org <mailto:semantic-web@w3.org>>, 
>> <public-schemaorg@w3.org <mailto:public-schemaorg@w3.org>>, 
>> <public-vocabs@w3.org <mailto:public-vocabs@w3.org>>
>> *Resent-From: *<semantic-web@w3.org <mailto:semantic-web@w3.org>>
>>
>> Hi All,
>>
>> we are happy to announce the new release of the WebDataCommons 
>> Microdata, JSON-LD, RDFa and Microformat data corpus.
>>
>> The data has been extracted from the November 2017 version of the 
>> Common Crawl covering 3.2 billion HTML pages which originate from 26 
>> million websites (pay-level domains).
>>
>> In summary, we found structured data within 1.2 billion HTML pages 
>> out of the 3.2 billion pages contained in the crawl (38.9%). These 
>> pages originate from 7.4 million different pay-level domains out of 
>> the 26 million pay-level-domains covered by the crawl (28.4%).
>>
>> Approximately 3.7 million of these websites use Microdata, 2.6 
>> million websites use JSON-LD, and 1.2 million websites make use of 
>> RDFa. Microformats are used by more than 3.3 million websites within 
>> the crawl.
>>
>> *Background:*
>>
>> More and more websites annotate data describing for 
>> instance products, people, organizations, places, events, reviews, 
>> and cooking  recipes within their HTML pages using markup formats 
>> such as Microdata, embedded JSON-LD, RDFa and Microformat.
>>
>> The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and 
>> Microformat data from the Common Crawl web corpus, the largest 
>> web corpus that is available to the public, and provides the 
>> extracted data for download. In addition, we publish statistics about 
>> the adoption of the different markup formats as well as 
>> the vocabularies that are used together with each format. We run 
>> yearly extractions since 2012 and we provide the dataset series as 
>> well as the related statistics at:
>>
>> http://webdatacommons.org/structureddata/
>>
>> *Statistics about the November 2017 Release:*
>>
>> Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, 
>> and Microformat data sets as well as the vocabularies that are used 
>> together with each markup format are found at:
>>
>> http://webdatacommons.org/structureddata/2017-12/stats/stats.html
>>
>> **
>>
>> *Markup Format Adoption*
>>
>> The page below provides an overview of the increase in the adoption 
>> of the different markup formats as well as widely usedschema.org 
>> <http://schema.org/>classes from 2012 to 2017:
>>
>> http://webdatacommons.org/structureddata/#toc10
>>
>> Comparing the statistics from the new 2017 release to the statistics 
>> about the October 2016 release of the data sets
>>
>> http://webdatacommons.org/structureddata/2016-10/stats/stats.html
>>
>> we see that the adoption of structured data keeps on increasing while 
>> Microdata remains the most dominant markup syntax. The different 
>> nature of the crawling strategy that was used makes it hard to 
>> compare absolute as well as certain relative numbers between the two 
>> releases. More concretely, we observe that the November 2017 Common 
>> Crawl corpus is much deeper for certain domains likeblogspot.com 
>> <http://blogspot.com/>andwordpress.com <http://wordpress.com/>while 
>> other domains are covered in a shallower way, with fewer URLs crawled 
>> in comparison to the October 2016 Common Crawl corpus. Nevertheless, 
>> it is clear that the growth rate of Microdata and Microformats is 
>> much higher than the one of RDFa and embedded JSON-LD. Although, the 
>> latter format is widely spread, it is mainly used to annotate 
>> metadata for search actions (80% of the domains using JSON-LD) while 
>> only a few domains use it for annotating content information such as 
>> Organizations (25% of the domains using JSON-LD), Persons (4% of the 
>> domains using JSON-LD) or Offers (0.1% of the domains using JSON-LD).
>>
>> **
>>
>> *Vocabulary Adoption*
>>
>> Concerning the vocabulary adoption,schema.org <http://schema.org/>, 
>> the vocabulary recommended by Google, Microsoft, Yahoo!, and Yandex 
>> continues to be the most dominant in the context of Microdata with 
>> 78% of the webmasters using it in comparison to its predecessor, the 
>> data-vocabulary, which is only used by 14% of the websites containing 
>> Microdata. In the context of RDFa, the Open Graph Protocol 
>> recommended by Facebook remains the most widely used vocabulary.
>>
>> **
>>
>> *Parallel Usage of Multiple Formats*
>>
>> Analyzing topic-specific subsets, we discover some interesting 
>> trends. As observed in the previous extractions, content related 
>> information is mostly described either with the Microdata format or 
>> less frequently with the JSON-LD format, in both cases using 
>> theschema.org <http://schema.org/>vocabulary. However, we find out 
>> that 30% of the websites that use JSON-LD annotations to describe 
>> product related information, make use of Microdata as well as JSON-LD 
>> to cover the same topic. This is not the case for other topics, such 
>> as Hotels or Job Postings, for which webmasters use only one format 
>> to annotate their content.
>>
>> **
>>
>> *Richer Descriptions of Job Postings*
>>
>> Following the release of the “Google for Jobs” search vertical and 
>> the more detailed guidance by Google on how to annotate job postings 
>> (https://developers.google.com/search/docs/data-types/job-posting),we 
>> see an increase in the number of websites annotating job postings 
>> (2017: 7,023, 2016: 6,352). In addition, the job posting annotations 
>> tend to become richer in comparison to the previous years as the 
>> number of Job Posting related properties adopted by at least 30% of 
>> the websites containing job offers has increased from 4 (2016) to 7 
>> (2017). The newly adopted properties are JobPosting/url, 
>> JobPosting/datePosted, and JobPosting/employmentType.
>>
>> You can find a more extended analysis concerning specific topics, 
>> like Job Posting and Product, here
>>
>> http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html#extendedanalysis
>>
>> *Download*
>>
>> The overall size of the November 2017 RDFa, Microdata, Embedded 
>> JSON-LD and Microformat data sets is 38.7 billion RDF quads. For 
>> download, we split the data into 8,433 files with a total size of 858 GB.
>>
>> http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html
>>
>> In addition, we have created for over 40 different schema.org 
>> <http://schema.org/> classes separate files, including all quads 
>> extracted from pages, using a specificschema.org 
>> <http://schema.org/>class.
>>
>> http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html
>>
>> *Lots of thanks to:*
>>
>> + the Common Crawl project for providing their great web crawl and 
>> thus enabling the WebDataCommons project.
>> + the Any23 project for providing their great library of structured 
>> data parsers.
>> + Amazon Web Services in Education Grant for supporting WebDataCommons.
>> + the Ministry of Economy, Research and Arts of Baden – Württemberg 
>> which supported through the ViCE project the extraction and analysis 
>> of the November 2017 corpus.
>>
>> *General Information about the WebDataCommons Project*
>>
>> The WebDataCommons project extracts structured data from the Common 
>> Crawl, the largest web corpus available to the public, and provides 
>> the extracted data for public download in order to support 
>> researchers and companies in exploiting the wealth of information 
>> that is available on the Web. Beside of the yearly extractions of 
>> semantic annotations from webpages, the WebDataCommons project also 
>> provides large hyperlink graphs, the largest public corpus of 
>> WebTables, a corpus of product data, as well as a collection of 
>> hypernyms extracted from billions of web pages for public download. 
>> General information about the WebDataCommons project is found at
>>
>> http://webdatacommons.org/
>>
>>
>> Have fun with the new data set.
>>
>> Cheers,
>> Anna Primpeli, Robert Meusel and Chris Bizer
>>
>
> Alasdair J G Gray
> Fellow of the Higher Education Academy
> Assistant Professor in Computer Science,
> School of Mathematical and Computer Sciences
> (Athena SWAN Bronze Award)
> Heriot-Watt University, Edinburgh UK.
>
> Email: A.J.G.Gray@hw.ac.uk <mailto:A.J.G.Gray@hw.ac.uk>
> Web: http://www.macs.hw.ac.uk/~ajg33 <http://www.macs.hw.ac.uk/%7Eajg33>
> ORCID: http://orcid.org/0000-0002-5711-4872
> Office: Earl Mountbatten Building 1.39
> Twitter: @gray_alasdair
>
>
>
>
>
>
>
>
>
>
> Untitled Document
> ------------------------------------------------------------------------
>
> */Heriot-Watt University is The Times & The Sunday Times International 
> University of the Year 2018/*
>
> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With 
> campuses and students across the entire globe we span the world, 
> delivering innovation and educational excellence in business, 
> engineering, design and the physical, social and life sciences.
>
> This email is generated from the Heriot-Watt University Group, which 
> includes:
>
>  1. Heriot-Watt University, a Scottish charity registered under number
>     SC000278
>  2. Edinburgh Business School a Charity Registered in Scotland,
>     SC026900. Edinburgh Business School is a company limited by
>     guarantee, registered in Scotland with registered number SC173556
>     and registered office at Heriot-Watt University Finance Office,
>     Riccarton, Currie, Midlothian, EH14 4AS
>  3. Heriot- Watt Services Limited (Oriam), Scotland's national
>     performance centre for sport. Heriot-Watt Services Limited is a
>     private limited company registered is Scotland with registered
>     number SC271030 and registered office at Research & Enterprise
>     Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.
>
> The contents (including any attachments) are confidential. If you are 
> not the intended recipient of this e-mail, any disclosure, copying, 
> distribution or use of its contents is strictly prohibited, and you 
> should please notify the sender immediately and then delete it 
> (including any attachments) from your system.
>

Received on Monday, 15 January 2018 10:26:14 UTC