- From: Gautam Kishore Shahi <gautamshahi16@gmail.com>
- Date: Fri, 12 Jan 2018 07:03:02 +0530
- To: Anna Primpeli <anna@informatik.uni-mannheim.de>
- Cc: semantic-web@w3.org, public-schemaorg@w3.org, public-vocabs@w3.org
- Message-ID: <CAAAHJp_tw8JO7A=sVs_294uXefRRT4+UR9iV-HTsS-wwuo1X3w@mail.gmail.com>
Dear Ms. Anna, Thank you for giving the update. There are some class specific data, which are not accessible. For instance, http://schema.org/GeoCoordinates. So, I request to update the class-specific data. Regards, On Thu, Jan 11, 2018 at 3:05 PM, Anna Primpeli < anna@informatik.uni-mannheim.de> wrote: > Hi All, > > we are happy to announce the new release of the WebDataCommons Microdata, > JSON-LD, RDFa and Microformat data corpus. > > The data has been extracted from the November 2017 version of the Common > Crawl covering 3.2 billion HTML pages which originate from 26 million > websites (pay-level domains). > > In summary, we found structured data within 1.2 billion HTML pages out of > the 3.2 billion pages contained in the crawl (38.9%). These pages originate > from 7.4 million different pay-level domains out of the 26 million > pay-level-domains covered by the crawl (28.4%). > > Approximately 3.7 million of these websites use Microdata, 2.6 million > websites use JSON-LD, and 1.2 million websites make use of RDFa. > Microformats are used by more than 3.3 million websites within the crawl. > > > > *Background:* > > More and more websites annotate data describing for instance products, > people, organizations, places, events, reviews, and cooking recipes within > their HTML pages using markup formats such as Microdata, embedded JSON-LD, > RDFa and Microformat. > > The WebDataCommons project extracts all Microdata, JSON-LD, RDFa, and > Microformat data from the Common Crawl web corpus, the largest web corpus > that is available to the public, and provides the extracted data for > download. In addition, we publish statistics about the adoption of the > different markup formats as well as the vocabularies that are used together > with each format. We run yearly extractions since 2012 and we provide the > dataset series as well as the related statistics at: > > http://webdatacommons.org/structureddata/ > > > > *Statistics about the November 2017 Release:* > > Basic statistics about the November 2017 Microdata, JSON-LD, RDFa, and > Microformat data sets as well as the vocabularies that are used together > with each markup format are found at: > > http://webdatacommons.org/structureddata/2017-12/stats/stats.html > > > > *Markup Format Adoption* > > The page below provides an overview of the increase in the adoption of the > different markup formats as well as widely used schema.org classes from > 2012 to 2017: > > http://webdatacommons.org/structureddata/#toc10 > > Comparing the statistics from the new 2017 release to the statistics about > the October 2016 release of the data sets > > http://webdatacommons.org/structureddata/2016-10/stats/stats.html > > we see that the adoption of structured data keeps on increasing while > Microdata remains the most dominant markup syntax. The different nature of > the crawling strategy that was used makes it hard to compare absolute as > well as certain relative numbers between the two releases. More concretely, > we observe that the November 2017 Common Crawl corpus is much deeper for > certain domains like blogspot.com and wordpress.com while other domains > are covered in a shallower way, with fewer URLs crawled in comparison to > the October 2016 Common Crawl corpus. Nevertheless, it is clear that the > growth rate of Microdata and Microformats is much higher than the one of > RDFa and embedded JSON-LD. Although, the latter format is widely spread, > it is mainly used to annotate metadata for search actions (80% of the > domains using JSON-LD) while only a few domains use it for annotating > content information such as Organizations (25% of the domains using > JSON-LD), Persons (4% of the domains using JSON-LD) or Offers (0.1% of the > domains using JSON-LD). > > > > *Vocabulary Adoption* > > Concerning the vocabulary adoption, schema.org, the vocabulary > recommended by Google, Microsoft, Yahoo!, and Yandex continues to be the > most dominant in the context of Microdata with 78% of the webmasters using > it in comparison to its predecessor, the data-vocabulary, which is only > used by 14% of the websites containing Microdata. In the context of RDFa, > the Open Graph Protocol recommended by Facebook remains the most widely > used vocabulary. > > > > *Parallel Usage of Multiple Formats* > > Analyzing topic-specific subsets, we discover some interesting trends. As > observed in the previous extractions, content related information is mostly > described either with the Microdata format or less frequently with the > JSON-LD format, in both cases using the schema.org vocabulary. However, > we find out that 30% of the websites that use JSON-LD annotations to > describe product related information, make use of Microdata as well as > JSON-LD to cover the same topic. This is not the case for other topics, > such as Hotels or Job Postings, for which webmasters use only one format to > annotate their content. > > > > *Richer Descriptions of Job Postings* > > Following the release of the “Google for Jobs” search vertical and the > more detailed guidance by Google on how to annotate job postings ( > https://developers.google.com/search/docs/data-types/job-posting), we see > an increase in the number of websites annotating job postings (2017: 7,023, > 2016: 6,352). In addition, the job posting annotations tend to become > richer in comparison to the previous years as the number of Job Posting > related properties adopted by at least 30% of the websites containing job > offers has increased from 4 (2016) to 7 (2017). The newly adopted > properties are JobPosting/url, JobPosting/datePosted, and > JobPosting/employmentType. > > You can find a more extended analysis concerning specific topics, like Job > Posting and Product, here > > http://webdatacommons.org/structureddata/2017-12/stats/ > schema_org_subsets.html#extendedanalysis > > > > *Download * > > The overall size of the November 2017 RDFa, Microdata, Embedded JSON-LD > and Microformat data sets is 38.7 billion RDF quads. For download, we split > the data into 8,433 files with a total size of 858 GB. > > http://webdatacommons.org/structureddata/2017-12/stats/ > how_to_get_the_data.html > > In addition, we have created for over 40 different schema.org classes > separate files, including all quads extracted from pages, using a specific > schema.org class. > > http://webdatacommons.org/structureddata/2017-12/stats/ > schema_org_subsets.html > > > > *Lots of thanks to:* > > + the Common Crawl project for providing their great web crawl and > thus enabling the WebDataCommons project. > + the Any23 project for providing their great library of structured > data parsers. > + Amazon Web Services in Education Grant for supporting WebDataCommons. > + the Ministry of Economy, Research and Arts of Baden – Württemberg which > supported through the ViCE project the extraction and analysis of the > November 2017 corpus. > > *General Information about the WebDataCommons Project* > > The WebDataCommons project extracts structured data from the Common Crawl, > the largest web corpus available to the public, and provides the extracted > data for public download in order to support researchers and companies in > exploiting the wealth of information that is available on the Web. Beside > of the yearly extractions of semantic annotations from webpages, the > WebDataCommons project also provides large hyperlink graphs, the largest > public corpus of WebTables, a corpus of product data, as well as a > collection of hypernyms extracted from billions of web pages for public > download. General information about the WebDataCommons project is found at > > http://webdatacommons.org/ > > > Have fun with the new data set. > > Cheers, > Anna Primpeli, Robert Meusel and Chris Bizer > > > -- Gautam Kishore Shahi, Master Student, DISI- University of Trento, Italy
Received on Friday, 12 January 2018 01:40:09 UTC