RE: WebDataCommons releases 38.7 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 7.4 million pay-level-domains

Thanks, I'll check them out.

On Jan 16, 2018 22:48, "Anna Primpeli" <anna@informatik.uni-mannheim.de>
wrote:

> Dear Mr. Shahi,
>
>
>
> Thank you very much for the notification.
>
> I updated the links. All the files should be accessible now.
>
>
>
> Please let us know in case you have any another issues accessing the WDC
> datasets.
>
>
>
> Best Regards,
>
> Anna
>
>
>
> *From:* Gautam Kishore Shahi [mailto:gautamshahi16@gmail.com]
> *Sent:* Friday, January 12, 2018 2:33 AM
> *To:* Anna Primpeli <anna@informatik.uni-mannheim.de>
> *Cc:* semantic-web@w3.org; public-schemaorg@w3.org; public-vocabs@w3.org
> *Subject:* Re: WebDataCommons releases 38.7 billion quads Microdata,
> Embedded JSON-LD, RDFa and Microformat data originating from 7.4 million
> pay-level-domains
>
>
>
> Dear Ms. Anna,
>
>
>
> Thank you for giving the update.
>
>
>
> There are some class specific data, which are not accessible. For
> instance, http://schema.org/GeoCoordinates. So, I request to update the
> class-specific data.
>
>
>
> Regards,
>
>
>
> On Thu, Jan 11, 2018 at 3:05 PM, Anna Primpeli <
> anna@informatik.uni-mannheim.de> wrote:
>
> Hi All,
>
> we are happy to announce the new release of the WebDataCommons Microdata,
> JSON-LD, RDFa and Microformat data corpus.
>
> The data has been extracted from the November 2017 version of the Common
> Crawl covering 3.2 billion HTML pages which originate from 26 million
> websites (pay-level domains).
>
> In summary, we found structured data within 1.2 billion HTML pages out of
> the 3.2 billion pages contained in the crawl (38.9%). These pages originate
> from 7.4 million different pay-level domains out of the 26 million
> pay-level-domains covered by the crawl (28.4%).
>
> Approximately 3.7 million of these websites use Microdata, 2.6 million
> websites use JSON-LD, and 1.2 million websites make use of RDFa.
> Microformats are used by more than 3.3 million websites within the crawl.
>
>
>
> *Background:*
>
> More and more websites annotate data describing for instance products,
> people, organizations, places, events, reviews, and cooking  recipes within
> their HTML pages using markup formats such as Microdata, embedded JSON-LD,
> RDFa and Microformat.
>
> The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and
> Microformat data from the Common Crawl web corpus, the largest web corpus
> that is available to the public, and provides the extracted data for
> download. In addition, we publish statistics about the adoption of the
> different markup formats as well as the vocabularies that are used together
> with each format. We run yearly extractions since 2012 and we provide the
> dataset series as well as the related statistics at:
>
> http://webdatacommons.org/structureddata/
>
>
>
> *Statistics about the November 2017 Release:*
>
> Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and
> Microformat data sets as well as the vocabularies that are used together
> with each markup format are found at:
>
> http://webdatacommons.org/structureddata/2017-12/stats/stats.html
>
>
>
> *Markup Format Adoption*
>
> The page below provides an overview of the increase in the adoption of the
> different markup formats as well as widely used schema.org classes from
> 2012 to 2017:
>
> http://webdatacommons.org/structureddata/#toc10
>
> Comparing the statistics from the new 2017 release to the statistics about
> the October 2016 release of the data sets
>
> http://webdatacommons.org/structureddata/2016-10/stats/stats.html
>
> we see that the adoption of structured data keeps on increasing while
> Microdata remains the most dominant markup syntax. The different nature of
> the crawling strategy that was used makes it hard to compare absolute as
> well as certain relative numbers between the two releases. More concretely,
> we observe that the November 2017 Common Crawl corpus is much deeper for
> certain domains like blogspot.com and wordpress.com while other domains
> are covered in a shallower way, with fewer URLs crawled in comparison to
> the October 2016 Common Crawl corpus. Nevertheless, it is clear that the
> growth rate of Microdata and Microformats is much higher than the one of
> RDFa and embedded JSON-LD.  Although, the latter format is widely spread,
> it is mainly used to annotate metadata for search actions (80% of the
> domains using JSON-LD) while only a few domains use it for annotating
> content information such as Organizations (25% of the domains using
> JSON-LD), Persons (4% of the domains using JSON-LD) or Offers (0.1% of the
> domains using JSON-LD).
>
>
>
> *Vocabulary Adoption*
>
> Concerning the vocabulary adoption, schema.org, the vocabulary
> recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the
> most dominant in the context of Microdata with 78% of the webmasters using
> it in comparison to its predecessor, the data-vocabulary, which is only
> used by 14% of the websites containing Microdata. In the context of RDFa,
> the Open Graph Protocol recommended by Facebook remains the most widely
> used vocabulary.
>
>
>
> *Parallel Usage of Multiple Formats*
>
> Analyzing topic-specific subsets, we discover some interesting trends. As
> observed in the previous extractions, content related information is mostly
> described either with the Microdata format or less frequently with the
> JSON-LD format, in both cases using the schema.org vocabulary. However,
> we find out that 30% of the websites that use JSON-LD annotations to
> describe product related information, make use of Microdata as well as
> JSON-LD to cover the same topic. This is not the case for other topics,
> such as Hotels or Job Postings, for which webmasters use only one format to
> annotate their content.
>
>
>
> *Richer Descriptions of Job Postings*
>
> Following the release of the “Google for Jobs” search vertical and the
> more detailed guidance by Google on how to annotate job postings (
> https://developers.google.com/search/docs/data-types/job-posting), we see
> an increase in the number of websites annotating job postings (2017: 7,023,
> 2016: 6,352). In addition, the job posting annotations tend to become
> richer in comparison to the previous years as the number of Job Posting
> related properties adopted by at least 30% of the websites containing job
> offers has increased from 4 (2016) to 7 (2017). The newly adopted
> properties are JobPosting/url, JobPosting/datePosted, and
> JobPosting/employmentType.
>
> You can find a more extended analysis concerning specific topics, like Job
> Posting and Product, here
>
> http://webdatacommons.org/structureddata/2017-12/stats/
> schema_org_subsets.html#extendedanalysis
>
>
>
> *Download *
>
> The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD
> and Microformat data sets is 38.7 billion RDF quads. For download, we split
> the data into 8,433 files with a total size of 858 GB.
>
> http://webdatacommons.org/structureddata/2017-12/stats/
> how_to_get_the_data.html
>
> In addition, we have created for over 40 different schema.org classes
> separate files, including all quads extracted from pages, using a specific
> schema.org class.
>
> http://webdatacommons.org/structureddata/2017-12/stats/
> schema_org_subsets.html
>
>
>
> *Lots of thanks to:*
>
> + the Common Crawl project for providing their great web crawl and
> thus enabling the WebDataCommons project.
> + the Any23 project for providing their great library of structured
> data parsers.
> + Amazon Web Services in Education Grant for supporting WebDataCommons.
> + the Ministry of Economy, Research and Arts of Baden – Württemberg which
> supported through the ViCE project the extraction and analysis of the
> November 2017 corpus.
>
> *General Information about the WebDataCommons Project*
>
> The WebDataCommons project extracts structured data from the Common Crawl,
> the largest web corpus available to the public, and provides the extracted
> data for public download in order to support researchers and companies in
> exploiting the wealth of information that is available on the Web. Beside
> of the yearly extractions of semantic annotations from webpages, the
> WebDataCommons project also provides large hyperlink graphs, the largest
> public corpus of WebTables, a corpus of product data, as well as a
> collection of hypernyms extracted from billions of web pages for public
> download. General information about the WebDataCommons project is found at
>
> http://webdatacommons.org/
>
>
> Have fun with the new data set.
>
> Cheers,
> Anna Primpeli, Robert Meusel and Chris Bizer
>
>
>
>
>
>
>
> --
>
> Gautam Kishore Shahi,
>
> Master Student,
>
> DISI- University of Trento,
>
> Italy
>

Received on Tuesday, 16 January 2018 10:33:29 UTC