W3C home > Mailing lists > Public > public-vocabs@w3.org > January 2018

Re: WebDataCommons releases 38.7 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 7.4 million pay-level-domains

From: Gautam Kishore Shahi <gautamshahi16@gmail.com>
Date: Fri, 12 Jan 2018 07:03:02 +0530
Message-ID: <CAAAHJp_tw8JO7A=sVs_294uXefRRT4+UR9iV-HTsS-wwuo1X3w@mail.gmail.com>
To: Anna Primpeli <anna@informatik.uni-mannheim.de>
Cc: semantic-web@w3.org, public-schemaorg@w3.org, public-vocabs@w3.org
Dear Ms. Anna,

Thank you for giving the update.

There are some class specific data, which are not accessible. For instance,
http://schema.org/GeoCoordinates. So, I request to update the
class-specific data.


On Thu, Jan 11, 2018 at 3:05 PM, Anna Primpeli <
anna@informatik.uni-mannheim.de> wrote:

> Hi All,
> we are happy to announce the new release of the WebDataCommons Microdata,
> JSON-LD, RDFa and Microformat data corpus.
> The data has been extracted from the November 2017 version of the Common
> Crawl covering 3.2 billion HTML pages which originate from 26 million
> websites (pay-level domains).
> In summary, we found structured data within 1.2 billion HTML pages out of
> the 3.2 billion pages contained in the crawl (38.9%). These pages originate
> from 7.4 million different pay-level domains out of the 26 million
> pay-level-domains covered by the crawl (28.4%).
> Approximately 3.7 million of these websites use Microdata, 2.6 million
> websites use JSON-LD, and 1.2 million websites make use of RDFa.
> Microformats are used by more than 3.3 million websites within the crawl.
> *Background:*
> More and more websites annotate data describing for instance products,
> people, organizations, places, events, reviews, and cooking  recipes within
> their HTML pages using markup formats such as Microdata, embedded JSON-LD,
> RDFa and Microformat.
> The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and
> Microformat data from the Common Crawl web corpus, the largest web corpus
> that is available to the public, and provides the extracted data for
> download. In addition, we publish statistics about the adoption of the
> different markup formats as well as the vocabularies that are used together
> with each format. We run yearly extractions since 2012 and we provide the
> dataset series as well as the related statistics at:
> http://webdatacommons.org/structureddata/
> *Statistics about the November 2017 Release:*
> Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and
> Microformat data sets as well as the vocabularies that are used together
> with each markup format are found at:
> http://webdatacommons.org/structureddata/2017-12/stats/stats.html
> *Markup Format Adoption*
> The page below provides an overview of the increase in the adoption of the
> different markup formats as well as widely used schema.org classes from
> 2012 to 2017:
> http://webdatacommons.org/structureddata/#toc10
> Comparing the statistics from the new 2017 release to the statistics about
> the October 2016 release of the data sets
> http://webdatacommons.org/structureddata/2016-10/stats/stats.html
> we see that the adoption of structured data keeps on increasing while
> Microdata remains the most dominant markup syntax. The different nature of
> the crawling strategy that was used makes it hard to compare absolute as
> well as certain relative numbers between the two releases. More concretely,
> we observe that the November 2017 Common Crawl corpus is much deeper for
> certain domains like blogspot.com and wordpress.com while other domains
> are covered in a shallower way, with fewer URLs crawled in comparison to
> the October 2016 Common Crawl corpus. Nevertheless, it is clear that the
> growth rate of Microdata and Microformats is much higher than the one of
> RDFa and embedded JSON-LD.  Although, the latter format is widely spread,
> it is mainly used to annotate metadata for search actions (80% of the
> domains using JSON-LD) while only a few domains use it for annotating
> content information such as Organizations (25% of the domains using
> JSON-LD), Persons (4% of the domains using JSON-LD) or Offers (0.1% of the
> domains using JSON-LD).
> *Vocabulary Adoption*
> Concerning the vocabulary adoption, schema.org, the vocabulary
> recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the
> most dominant in the context of Microdata with 78% of the webmasters using
> it in comparison to its predecessor, the data-vocabulary, which is only
> used by 14% of the websites containing Microdata. In the context of RDFa,
> the Open Graph Protocol recommended by Facebook remains the most widely
> used vocabulary.
> *Parallel Usage of Multiple Formats*
> Analyzing topic-specific subsets, we discover some interesting trends. As
> observed in the previous extractions, content related information is mostly
> described either with the Microdata format or less frequently with the
> JSON-LD format, in both cases using the schema.org vocabulary. However,
> we find out that 30% of the websites that use JSON-LD annotations to
> describe product related information, make use of Microdata as well as
> JSON-LD to cover the same topic. This is not the case for other topics,
> such as Hotels or Job Postings, for which webmasters use only one format to
> annotate their content.
> *Richer Descriptions of Job Postings*
> Following the release of the “Google for Jobs” search vertical and the
> more detailed guidance by Google on how to annotate job postings (
> https://developers.google.com/search/docs/data-types/job-posting), we see
> an increase in the number of websites annotating job postings (2017: 7,023,
> 2016: 6,352). In addition, the job posting annotations tend to become
> richer in comparison to the previous years as the number of Job Posting
> related properties adopted by at least 30% of the websites containing job
> offers has increased from 4 (2016) to 7 (2017). The newly adopted
> properties are JobPosting/url, JobPosting/datePosted, and
> JobPosting/employmentType.
> You can find a more extended analysis concerning specific topics, like Job
> Posting and Product, here
> http://webdatacommons.org/structureddata/2017-12/stats/
> schema_org_subsets.html#extendedanalysis
> *Download *
> The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD
> and Microformat data sets is 38.7 billion RDF quads. For download, we split
> the data into 8,433 files with a total size of 858 GB.
> http://webdatacommons.org/structureddata/2017-12/stats/
> how_to_get_the_data.html
> In addition, we have created for over 40 different schema.org classes
> separate files, including all quads extracted from pages, using a specific
> schema.org class.
> http://webdatacommons.org/structureddata/2017-12/stats/
> schema_org_subsets.html
> *Lots of thanks to:*
> + the Common Crawl project for providing their great web crawl and
> thus enabling the WebDataCommons project.
> + the Any23 project for providing their great library of structured
> data parsers.
> + Amazon Web Services in Education Grant for supporting WebDataCommons.
> + the Ministry of Economy, Research and Arts of Baden – Württemberg which
> supported through the ViCE project the extraction and analysis of the
> November 2017 corpus.
> *General Information about the WebDataCommons Project*
> The WebDataCommons project extracts structured data from the Common Crawl,
> the largest web corpus available to the public, and provides the extracted
> data for public download in order to support researchers and companies in
> exploiting the wealth of information that is available on the Web. Beside
> of the yearly extractions of semantic annotations from webpages, the
> WebDataCommons project also provides large hyperlink graphs, the largest
> public corpus of WebTables, a corpus of product data, as well as a
> collection of hypernyms extracted from billions of web pages for public
> download. General information about the WebDataCommons project is found at
> http://webdatacommons.org/
> Have fun with the new data set.
> Cheers,
> Anna Primpeli, Robert Meusel and Chris Bizer

Gautam Kishore Shahi,
Master Student,
DISI- University of Trento,
Received on Friday, 12 January 2018 02:14:55 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:49:46 UTC