Re: ANN: WebDataCommons releases 44.2 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 5.6 million pay-level-domains

One thing that I would love to see from this extracted...

schema.org/Organization topped the charts as the main entity. (no surprise)

But for us and future forward...
What were the millions of schema.org/Thing 's that folks wired up, that we
don't have classes for yet ?

I bet someone could find that out and cluster them into some chart.
>From that chart, we could then actually see the next set of Types that we
should work on.

-Thad

On Thu, Jan 19, 2017 at 3:08 AM <anna@informatik.uni-mannheim.de> wrote:

>
>
>
>
> Hi all,
>
>
>
>
>
> we are happy to announce a new release of the WebDataCommons Microdata,
> Embedded JSON-LD, RDFa and Microformat data corpus.
>
> The data has been extracted from the October 2016 version of the
> CommonCrawl covering 3.2 billion HTML pages which originate from 34 million
> websites (pay-level domains).
>
> Altogether we discovered structured data within 1.2 billion HTML pages out
> of the 3.2 billion pages contained in the crawl (38%). These pages
> originate from 5.6 million different pay-level domains out of the 34
> billion pay-level domains covered by the crawl (16.5%).
>
> Approximately 2.5 million of these websites use Microdata, 2.1 million
> websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats
> are used by over 1.6 million websites within the crawl.
>
>
>
> *Background:*
>
> More and more websites annotate structured data within their HTML pages
> using markup formats such as RDFa, Microdata, embedded JSON-LD and
> Microformats. The annotations  cover topics such as products, reviews,
> people, organizations, places, events, and cooking  recipes.
>
> The WebDataCommons project extracts all Microdata, RDFa data, and
> Microformat data, and since 2015 also embedded JSON-LD data from the Common
> Crawl web corpus, the largest and most up-to-date web corpus that is
> available to the public, and provides the extracted data for download. In
> addition, we publish statistics about the adoption of the different markup
> formats as well as the vocabularies that are used together with each
> format.
>
> Besides the markup data, the WebDataCommons project also provides large
> web table corpora and web graphs for download. General information about
> the WebDataCommons project is found at
>
> http://webdatacommons.org/
>
>
>
> *Data Set Statistics: *
> Basic statistics about the October 2016 Microdata, Embedded JSON-LD, RDFa
> and Microformat data sets as well as the vocabularies that are used
> together with each
> markup format are found at:
>
> http://webdatacommons.org/structureddata/2016-10/stats/stats.html
>
> Comparing the statistics to the statistics about the November 2015 release
> of the data sets
>
>
>
> http://webdatacommons.org/structureddata/2015-11/stats/stats.html
>
> we see that the Microdata syntax remains the most dominant annotation
> format. Although it is hard to compare the adoption of the syntax between
> the two years in absolute numbers, as the October 2016 crawl corpus is
> almost double the size of the November 2015 one, a relative increase can be
> observed: In the October 2016 corpus over 44% of the pay-level domains
> containing markup data make use of the Microdata syntax in comparison to
> 40% one year earlier. Even though the absolute numbers concerning the RDFa
> markup syntax adoption rise, the relative increase does not follow up the
> increase of the corpus size indicating that RDFa is less used by the
> websites. Similar to the 2015 release, the adoption of embedded JSON-LD has
> considerably increased, even though the main focus of the annotation
> remains the search action offered by the websites (70%).
>
> As already observed in the previous years, the schema.org vocabulary is
> most frequently used in the context of Microdata while the adoption of its
> predecessor, the data vocabulary, continues to decrease. In the context of
> RDFa, we still find the Open Graph Protocol recommended by Facebook to be
> the most widely used vocabulary.
>
> Topic-wise the trends identified in the former extractions continue. We
> see that beside of navigational, blog and CMS related meta-information,
> many websites annotate e-commerce related data (Products, Offers, and
> Reviews) as well as contact information (LocalBusiness, Organization,
> PostalAddress). More concretely, the October 2016 corpus includes more than
> 682 million product records originating from 249 thousand websites which
> use the schema.org vocabulary. The new release contains postal address
> data for more than 291 million entities originating from 338 thousand
> websites. Furthermore, the content describing hotels has doubled in size in
> this release, with a total of 61 million hotel descriptions.
>
> Visualizations of the main adoption trends concerning the different
> annotation formats, popular schema.org, as well as RDFa classes within
> the time span 2012 to 2016 are found at
>
> http://webdatacommons.org/structureddata/#toc8
>
>
>
> *Download:*
>
> The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD,
> and Microformat data sets is 44.2 billion RDF quads. For download, we split
> the data into 9,661 files with a total size of 987 GB.
>
>
> http://webdatacommons.org/structureddata/2016-10/stats/how_to_get_the_data.html
>
> In addition, we have created for over 40 different schema.org classes
> separate files, including all quads from pages, deploying at least once the
> specific class.
>
>
> http://webdatacommons.org/structureddata/2016-10/stats/schema_org_subsets.html
>
>
>
> *Lots of thanks to:*
>
> + the Common Crawl project for providing their great web crawl and
> thus enabling the WebDataCommons project.
> + the Any23 project for providing their great library of structured
> data parsers.
> + Amazon Web Services in Education Grant for supporting WebDataCommons.
> + the Ministry of Economy, Research and Arts of Baden – Württemberg which
> supported by means of the ViCe project the extraction and analysis of the
> October 2016 corpus.
>
>
> Have fun with the new data set.
>
> Cheers,
> Anna Primpeli, Robert Meusel, and Chris Bizer
>
>
>
>
>

Received on Thursday, 19 January 2017 15:02:03 UTC