FYI Re: WebDataCommons releases 38.7 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 7.4 million pay-level-domains

On 1/11/18 4:35 AM, Anna Primpeli wrote:
>
> Hi All,
>
> we are happy to announce the new release of the WebDataCommons
> Microdata, JSON-LD, RDFa and Microformat data corpus.
>
> The data has been extracted from the November 2017 version of the
> Common Crawl covering 3.2 billion HTML pages which originate from 26
> million websites (pay-level domains).
>
> In summary, we found structured data within 1.2 billion HTML pages out
> of the 3.2 billion pages contained in the crawl (38.9%). These pages
> originate from 7.4 million different pay-level domains out of the 26
> million pay-level-domains covered by the crawl (28.4%).
>
> Approximately 3.7 million of these websites use Microdata, 2.6 million
> websites use JSON-LD, and 1.2 million websites make use of RDFa.
> Microformats are used by more than 3.3 million websites within the crawl.
>
>  
>
> *Background:* 
>
> More and more websites annotate data describing for instance products,
> people, organizations, places, events, reviews, and cooking  recipes
> within their HTML pages using markup formats such as Microdata,
> embedded JSON-LD, RDFa and Microformat. 
>
> The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and
> Microformat data from the Common Crawl web corpus, the largest
> web corpus that is available to the public, and provides the extracted
> data for download. In addition, we publish statistics about the
> adoption of the different markup formats as well as the vocabularies
> that are used together with each format. We run yearly extractions
> since 2012 and we provide the dataset series as well as the related
> statistics at:
>
> http://webdatacommons.org/structureddata/
>
>  
>
> *Statistics about the November 2017 Release:*
>
> Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and
> Microformat data sets as well as the vocabularies that are used
> together with each markup format are found at: 
>
> http://webdatacommons.org/structureddata/2017-12/stats/stats.html
>
> * *
>
> *Markup Format Adoption*
>
> The page below provides an overview of the increase in the adoption of
> the different markup formats as well as widely used schema.org classes
> from 2012 to 2017:
>
> http://webdatacommons.org/structureddata/#toc10
>
> Comparing the statistics from the new 2017 release to the statistics
> about the October 2016 release of the data sets
>
> http://webdatacommons.org/structureddata/2016-10/stats/stats.html
>
> we see that the adoption of structured data keeps on increasing while
> Microdata remains the most dominant markup syntax. The different
> nature of the crawling strategy that was used makes it hard to compare
> absolute as well as certain relative numbers between the two releases.
> More concretely, we observe that the November 2017 Common Crawl corpus
> is much deeper for certain domains like blogspot.com and wordpress.com
> while other domains are covered in a shallower way, with fewer URLs
> crawled in comparison to the October 2016 Common Crawl corpus.
> Nevertheless, it is clear that the growth rate of Microdata and
> Microformats is much higher than the one of RDFa and embedded
> JSON-LD.  Although, the latter format is widely spread, it is mainly
> used to annotate metadata for search actions (80% of the domains using
> JSON-LD) while only a few domains use it for annotating content
> information such as Organizations (25% of the domains using JSON-LD),
> Persons (4% of the domains using JSON-LD) or Offers (0.1% of the
> domains using JSON-LD).
>
> * *
>
> *Vocabulary Adoption*
>
> Concerning the vocabulary adoption, schema.org, the vocabulary
> recommended by Google, Microsoft, Yahoo!, and Yandex continues to be
> the most dominant in the context of Microdata with 78% of the
> webmasters using it in comparison to its predecessor, the
> data-vocabulary, which is only used by 14% of the websites containing
> Microdata. In the context of RDFa, the Open Graph Protocol recommended
> by Facebook remains the most widely used vocabulary.
>
> * *
>
> *Parallel Usage of Multiple Formats*
>
> Analyzing topic-specific subsets, we discover some interesting trends.
> As observed in the previous extractions, content related information
> is mostly described either with the Microdata format or less
> frequently with the JSON-LD format, in both cases using the schema.org
> vocabulary. However, we find out that 30% of the websites that use
> JSON-LD annotations to describe product related information, make use
> of Microdata as well as JSON-LD to cover the same topic. This is not
> the case for other topics, such as Hotels or Job Postings, for which
> webmasters use only one format to annotate their content.
>
> * *
>
> *Richer Descriptions of Job Postings*
>
> Following the release of the “Google for Jobs” search vertical and the
> more detailed guidance by Google on how to annotate job postings
> (https://developers.google.com/search/docs/data-types/job-posting), we
> see an increase in the number of websites annotating job postings
> (2017: 7,023, 2016: 6,352). In addition, the job posting annotations
> tend to become richer in comparison to the previous years as the
> number of Job Posting related properties adopted by at least 30% of
> the websites containing job offers has increased from 4 (2016) to 7
> (2017). The newly adopted properties are JobPosting/url,
> JobPosting/datePosted, and JobPosting/employmentType.
>
> You can find a more extended analysis concerning specific topics, like
> Job Posting and Product, here
>
> http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html#extendedanalysis
>
>  
>
> *Download *
>
> The overall size of the November 2017 RDFa, Microdata, Embedded
> JSON-LD and Microformat data sets is 38.7 billion RDF quads. For
> download, we split the data into 8,433 files with a total size of 858 GB.
>
> http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html
>
> In addition, we have created for over 40 different schema.org
> <http://schema.org/> classes separate files, including all quads
> extracted from pages, using a specific schema.org class. 
>
> http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html
>
>  
>
> *Lots of thanks to:* 
>
> + the Common Crawl project for providing their great web crawl and
> thus enabling the WebDataCommons project. 
> + the Any23 project for providing their great library of structured
> data parsers. 
> + Amazon Web Services in Education Grant for supporting WebDataCommons. 
> + the Ministry of Economy, Research and Arts of Baden – Württemberg
> which supported through the ViCE project the extraction and analysis
> of the November 2017 corpus.
>
> *General Information about the WebDataCommons Project*
>
> The WebDataCommons project extracts structured data from the Common
> Crawl, the largest web corpus available to the public, and provides
> the extracted data for public download in order to support researchers
> and companies in exploiting the wealth of information that is
> available on the Web. Beside of the yearly extractions of semantic
> annotations from webpages, the WebDataCommons project also provides
> large hyperlink graphs, the largest public corpus of WebTables, a
> corpus of product data, as well as a collection of hypernyms extracted
> from billions of web pages for public download. General information
> about the WebDataCommons project is found at 
>
> http://webdatacommons.org/
>
>
> Have fun with the new data set. 
>
> Cheers, 
> Anna Primpeli, Robert Meusel and Chris Bizer
>
>  
>

-- 
Regards,

Kingsley Idehen	      
Founder & CEO 
OpenLink Software   (Home Page: http://www.openlinksw.com)

Weblogs (Blogs):
Legacy Blog: http://www.openlinksw.com/blog/~kidehen/
Blogspot Blog: http://kidehen.blogspot.com
Medium Blog: https://medium.com/@kidehen

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
        : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Received on Thursday, 11 January 2018 18:17:20 UTC