- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Thu, 11 Jan 2018 13:16:44 -0500
- To: public-vocabs@w3.org, "public-rww@w3.org" <public-rww@w3.org>
- Cc: Virtuoso-users <Virtuoso-users@lists.sourceforge.net>, "dbpedia-discussion@lists.sourceforge.net" <dbpedia-discussion@lists.sourceforge.net>
- Message-ID: <a3628350-b08d-9bd0-1ae8-1dbd5241721a@openlinksw.com>
On 1/11/18 4:35 AM, Anna Primpeli wrote: > > Hi All, > > we are happy to announce the new release of the WebDataCommons > Microdata, JSON-LD, RDFa and Microformat data corpus. > > The data has been extracted from the November 2017 version of the > Common Crawl covering 3.2 billion HTML pages which originate from 26 > million websites (pay-level domains). > > In summary, we found structured data within 1.2 billion HTML pages out > of the 3.2 billion pages contained in the crawl (38.9%). These pages > originate from 7.4 million different pay-level domains out of the 26 > million pay-level-domains covered by the crawl (28.4%). > > Approximately 3.7 million of these websites use Microdata, 2.6 million > websites use JSON-LD, and 1.2 million websites make use of RDFa. > Microformats are used by more than 3.3 million websites within the crawl. > > > > *Background:* > > More and more websites annotate data describing for instance products, > people, organizations, places, events, reviews, and cooking recipes > within their HTML pages using markup formats such as Microdata, > embedded JSON-LD, RDFa and Microformat. > > The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and > Microformat data from the Common Crawl web corpus, the largest > web corpus that is available to the public, and provides the extracted > data for download. In addition, we publish statistics about the > adoption of the different markup formats as well as the vocabularies > that are used together with each format. We run yearly extractions > since 2012 and we provide the dataset series as well as the related > statistics at: > > http://webdatacommons.org/structureddata/ > > > > *Statistics about the November 2017 Release:* > > Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and > Microformat data sets as well as the vocabularies that are used > together with each markup format are found at: > > http://webdatacommons.org/structureddata/2017-12/stats/stats.html > > * * > > *Markup Format Adoption* > > The page below provides an overview of the increase in the adoption of > the different markup formats as well as widely used schema.org classes > from 2012 to 2017: > > http://webdatacommons.org/structureddata/#toc10 > > Comparing the statistics from the new 2017 release to the statistics > about the October 2016 release of the data sets > > http://webdatacommons.org/structureddata/2016-10/stats/stats.html > > we see that the adoption of structured data keeps on increasing while > Microdata remains the most dominant markup syntax. The different > nature of the crawling strategy that was used makes it hard to compare > absolute as well as certain relative numbers between the two releases. > More concretely, we observe that the November 2017 Common Crawl corpus > is much deeper for certain domains like blogspot.com and wordpress.com > while other domains are covered in a shallower way, with fewer URLs > crawled in comparison to the October 2016 Common Crawl corpus. > Nevertheless, it is clear that the growth rate of Microdata and > Microformats is much higher than the one of RDFa and embedded > JSON-LD. Although, the latter format is widely spread, it is mainly > used to annotate metadata for search actions (80% of the domains using > JSON-LD) while only a few domains use it for annotating content > information such as Organizations (25% of the domains using JSON-LD), > Persons (4% of the domains using JSON-LD) or Offers (0.1% of the > domains using JSON-LD). > > * * > > *Vocabulary Adoption* > > Concerning the vocabulary adoption, schema.org, the vocabulary > recommended by Google, Microsoft, Yahoo!, and Yandex continues to be > the most dominant in the context of Microdata with 78% of the > webmasters using it in comparison to its predecessor, the > data-vocabulary, which is only used by 14% of the websites containing > Microdata. In the context of RDFa, the Open Graph Protocol recommended > by Facebook remains the most widely used vocabulary. > > * * > > *Parallel Usage of Multiple Formats* > > Analyzing topic-specific subsets, we discover some interesting trends. > As observed in the previous extractions, content related information > is mostly described either with the Microdata format or less > frequently with the JSON-LD format, in both cases using the schema.org > vocabulary. However, we find out that 30% of the websites that use > JSON-LD annotations to describe product related information, make use > of Microdata as well as JSON-LD to cover the same topic. This is not > the case for other topics, such as Hotels or Job Postings, for which > webmasters use only one format to annotate their content. > > * * > > *Richer Descriptions of Job Postings* > > Following the release of the “Google for Jobs” search vertical and the > more detailed guidance by Google on how to annotate job postings > (https://developers.google.com/search/docs/data-types/job-posting), we > see an increase in the number of websites annotating job postings > (2017: 7,023, 2016: 6,352). In addition, the job posting annotations > tend to become richer in comparison to the previous years as the > number of Job Posting related properties adopted by at least 30% of > the websites containing job offers has increased from 4 (2016) to 7 > (2017). The newly adopted properties are JobPosting/url, > JobPosting/datePosted, and JobPosting/employmentType. > > You can find a more extended analysis concerning specific topics, like > Job Posting and Product, here > > http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html#extendedanalysis > > > > *Download * > > The overall size of the November 2017 RDFa, Microdata, Embedded > JSON-LD and Microformat data sets is 38.7 billion RDF quads. For > download, we split the data into 8,433 files with a total size of 858 GB. > > http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html > > In addition, we have created for over 40 different schema.org > <http://schema.org/> classes separate files, including all quads > extracted from pages, using a specific schema.org class. > > http://webdatacommons.org/structureddata/2017-12/stats/schema_org_subsets.html > > > > *Lots of thanks to:* > > + the Common Crawl project for providing their great web crawl and > thus enabling the WebDataCommons project. > + the Any23 project for providing their great library of structured > data parsers. > + Amazon Web Services in Education Grant for supporting WebDataCommons. > + the Ministry of Economy, Research and Arts of Baden – Württemberg > which supported through the ViCE project the extraction and analysis > of the November 2017 corpus. > > *General Information about the WebDataCommons Project* > > The WebDataCommons project extracts structured data from the Common > Crawl, the largest web corpus available to the public, and provides > the extracted data for public download in order to support researchers > and companies in exploiting the wealth of information that is > available on the Web. Beside of the yearly extractions of semantic > annotations from webpages, the WebDataCommons project also provides > large hyperlink graphs, the largest public corpus of WebTables, a > corpus of product data, as well as a collection of hypernyms extracted > from billions of web pages for public download. General information > about the WebDataCommons project is found at > > http://webdatacommons.org/ > > > Have fun with the new data set. > > Cheers, > Anna Primpeli, Robert Meusel and Chris Bizer > > > -- Regards, Kingsley Idehen Founder & CEO OpenLink Software (Home Page: http://www.openlinksw.com) Weblogs (Blogs): Legacy Blog: http://www.openlinksw.com/blog/~kidehen/ Blogspot Blog: http://kidehen.blogspot.com Medium Blog: https://medium.com/@kidehen Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
Attachments
- application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Thursday, 11 January 2018 18:17:20 UTC