- From: Thad Guidry <thadguidry@gmail.com>
- Date: Thu, 19 Jan 2017 15:01:18 +0000
- To: "public-schemaorg@w3.org <public-schemaorg@w3.org>" <public-schemaorg@w3.org>
- Message-ID: <CAChbWaNDATbiZceK5BfrwZpidRhmULt+hzYqoX5ifWGWfPqiJg@mail.gmail.com>
One thing that I would love to see from this extracted... schema.org/Organization topped the charts as the main entity. (no surprise) But for us and future forward... What were the millions of schema.org/Thing 's that folks wired up, that we don't have classes for yet ? I bet someone could find that out and cluster them into some chart. >From that chart, we could then actually see the next set of Types that we should work on. -Thad On Thu, Jan 19, 2017 at 3:08 AM <anna@informatik.uni-mannheim.de> wrote: > > > > > Hi all, > > > > > > we are happy to announce a new release of the WebDataCommons Microdata, > Embedded JSON-LD, RDFa and Microformat data corpus. > > The data has been extracted from the October 2016 version of the > CommonCrawl covering 3.2 billion HTML pages which originate from 34 million > websites (pay-level domains). > > Altogether we discovered structured data within 1.2 billion HTML pages out > of the 3.2 billion pages contained in the crawl (38%). These pages > originate from 5.6 million different pay-level domains out of the 34 > billion pay-level domains covered by the crawl (16.5%). > > Approximately 2.5 million of these websites use Microdata, 2.1 million > websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats > are used by over 1.6 million websites within the crawl. > > > > *Background:* > > More and more websites annotate structured data within their HTML pages > using markup formats such as RDFa, Microdata, embedded JSON-LD and > Microformats. The annotations cover topics such as products, reviews, > people, organizations, places, events, and cooking recipes. > > The WebDataCommons project extracts all Microdata, RDFa data, and > Microformat data, and since 2015 also embedded JSON-LD data from the Common > Crawl web corpus, the largest and most up-to-date web corpus that is > available to the public, and provides the extracted data for download. In > addition, we publish statistics about the adoption of the different markup > formats as well as the vocabularies that are used together with each > format. > > Besides the markup data, the WebDataCommons project also provides large > web table corpora and web graphs for download. General information about > the WebDataCommons project is found at > > http://webdatacommons.org/ > > > > *Data Set Statistics: * > Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa > and Microformat data sets as well as the vocabularies that are used > together with each > markup format are found at: > > http://webdatacommons.org/structureddata/2016-10/stats/stats.html > > Comparing the statistics to the statistics about the November 2015 release > of the data sets > > > > http://webdatacommons.org/structureddata/2015-11/stats/stats.html > > we see that the Microdata syntax remains the most dominant annotation > format. Although it is hard to compare the adoption of the syntax between > the two years in absolute numbers, as the October 2016 crawl corpus is > almost double the size of the November 2015 one, a relative increase can be > observed: In the October 2016 corpus over 44% of the pay-level domains > containing markup data make use of the Microdata syntax in comparison to > 40% one year earlier. Even though the absolute numbers concerning the RDFa > markup syntax adoption rise, the relative increase does not follow up the > increase of the corpus size indicating that RDFa is less used by the > websites. Similar to the 2015 release, the adoption of embedded JSON-LD has > considerably increased, even though the main focus of the annotation > remains the search action offered by the websites (70%). > > As already observed in the previous years, the schema.org vocabulary is > most frequently used in the context of Microdata while the adoption of its > predecessor, the data vocabulary, continues to decrease. In the context of > RDFa, we still find the Open Graph Protocol recommended by Facebook to be > the most widely used vocabulary. > > Topic-wise the trends identified in the former extractions continue. We > see that beside of navigational, blog and CMS related meta-information, > many websites annotate e-commerce related data (Products, Offers, and > Reviews) as well as contact information (LocalBusiness, Organization, > PostalAddress). More concretely, the October 2016 corpus includes more than > 682 million product records originating from 249 thousand websites which > use the schema.org vocabulary. The new release contains postal address > data for more than 291 million entities originating from 338 thousand > websites. Furthermore, the content describing hotels has doubled in size in > this release, with a total of 61 million hotel descriptions. > > Visualizations of the main adoption trends concerning the different > annotation formats, popular schema.org, as well as RDFa classes within > the time span 2012 to 2016 are found at > > http://webdatacommons.org/structureddata/#toc8 > > > > *Download:* > > The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, > and Microformat data sets is 44.2 billion RDF quads. For download, we split > the data into 9,661 files with a total size of 987 GB. > > > http://webdatacommons.org/structureddata/2016-10/stats/how_to_get_the_data.html > > In addition, we have created for over 40 different schema.org classes > separate files, including all quads from pages, deploying at least once the > specific class. > > > http://webdatacommons.org/structureddata/2016-10/stats/schema_org_subsets.html > > > > *Lots of thanks to:* > > + the Common Crawl project for providing their great web crawl and > thus enabling the WebDataCommons project. > + the Any23 project for providing their great library of structured > data parsers. > + Amazon Web Services in Education Grant for supporting WebDataCommons. > + the Ministry of Economy, Research and Arts of Baden – Württemberg which > supported by means of the ViCe project the extraction and analysis of the > October 2016 corpus. > > > Have fun with the new data set. > > Cheers, > Anna Primpeli, Robert Meusel, and Chris Bizer > > > > >
Received on Thursday, 19 January 2017 15:02:03 UTC