RE: WebDataCommons releases 44.2 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 5.6 million pay-level-domains

Hello all,

We added a disclaimer in the Web Data Commons website (http://webdatacommons.org/structureddata/#disclaimer) explaining the blank node issue brought up by Hugh. 

The reason of using the same blank node identifiers for more than one unique blank node lies on the parallel execution of the extraction framework. However, the unique character of blank nodes can be easily retrieved by combining the blank node identifier with the fourth part of the quad, ie: the URL.

We would like to leave this issue open for discussion and we welcome you to send us any possible suggestions you have concerning this topic.

Thank you in advance!

Cheers,
Anna Primpeli, Robert Meusel and Chris Bizer



-----Original Message-----
From: Hugh Glaser [mailto:hugh@glasers.org] 
Sent: Sunday, January 22, 2017 3:10 PM
To: Anna Primpeli <anna@informatik.uni-mannheim.de>
Cc: Semantic Web IG <semantic-web@w3.org>; public-vocabs@w3.org
Subject: Re: WebDataCommons releases 44.2 billion quads Microdata, Embedded JSON-LD, RDFa and Microformat data originating from 5.6 million pay-level-domains

Hi,
So here's a couple of questions.

The html-rdfa data has very long bnode labels (MD5-like, plus some extra.).
Can I assume that these are unique to the whole dataset?
That is, if I were to do the equivalent of loading all the quads from the whole dataset into the same store, I won't get any collisions on bnodes.
It would be great if I can.
And when you are thinking about what to do with the jsonld, it would be great if the solution also meant that each bnode was unique to the whole dataset.

Best
Hugh

> On 21 Jan 2017, at 11:16, Hugh Glaser <hugh@glasers.org> wrote:
> 
> Hi there,
> I'm really sorry I seem to have found a problem with the NQuads in the jsonld part of your data dumps.
> (I haven't looked at any of the others.) It's to do with the blank 
> node labelling, https://lists.w3.org/Archives/Public/public-lod/2017Jan/0051.html in case you didn't see it.
> It is, of course, with a heavy heart I report this.
> Very best
> Hugh
> 
>> On 20 Jan 2017, at 12:23, Anna Primpeli <anna@informatik.uni-mannheim.de> wrote:
>> 
>> Hi all,
>> 
>> 
>> we are happy to announce a new release of the WebDataCommons Microdata, Embedded JSON-LD, RDFa and Microformat data corpus.
>> The data has been extracted from the October 2016 version of the CommonCrawl covering 3.2 billion HTML pages which originate from 34 million websites (pay-level domains).
>> Altogether we discovered structured data within 1.2 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.6 million different pay-level domains out of the 34 billion pay-level domains covered by the crawl (16.5%).
>> Approximately 2.5 million of these websites use Microdata, 2.1 million websites employ JSON-LD, and 938 thousand websites use RDFa. Microformats are used by over 1.6 million websites within the crawl.
>> 
>> Background: 
>> 
>> More and more websites annotate structured data within their HTML pages using markup formats such as RDFa, Microdata, embedded JSON-LD and Microformats. The annotations  cover topics such as products, reviews, people, organizations, places, events, and cooking  recipes.
>> 
>> The WebDataCommons project extracts all Microdata, RDFa data, and Microformat data, and since 2015 also embedded JSON-LD data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is available to the public, and provides the extracted data for download. In addition, we publish statistics about the adoption of the different markup formats as well as the vocabularies that are used together with each format. 
>> Besides the markup data, the WebDataCommons project also provides 
>> large web table corpora and web graphs for download. General 
>> information about the WebDataCommons project is found at
>> 
>> http://webdatacommons.org/
>> 
>> 
>> Data Set Statistics: 
>> 
>> Basic statistics about the October 2016 Microdata, Embedded JSON-LD, 
>> RDFa and Microformat data sets as well as the vocabularies that are 
>> used together with each markup format are found at:
>> 
>> http://webdatacommons.org/structureddata/2016-10/stats/stats.html
>> Comparing the statistics to the statistics about the November 2015 
>> release of the data sets
>> 
>> http://webdatacommons.org/structureddata/2015-11/stats/stats.html
>> we see that the Microdata syntax remains the most dominant annotation format. Although it is hard to compare the adoption of the syntax between the two years in absolute numbers, as the October 2016 crawl corpus is almost double the size of the November 2015 one, a relative increase can be observed: In the October 2016 corpus over 44% of the pay-level domains containing markup data make use of the Microdata syntax in comparison to 40% one year earlier. Even though the absolute numbers concerning the RDFa markup syntax adoption rise, the relative increase does not follow up the increase of the corpus size indicating that RDFa is less used by the websites. Similar to the 2015 release, the adoption of embedded JSON-LD has considerably increased, even though the main focus of the annotation remains the search action offered by the websites (70%).
>> As already observed in the previous years, the schema.org vocabulary is most frequently used in the context of Microdata while the adoption of its predecessor, the data vocabulary, continues to decrease. In the context of RDFa, we still find the Open Graph Protocol recommended by Facebook to be the most widely used vocabulary.
>> Topic-wise the trends identified in the former extractions continue. We see that beside of navigational, blog and CMS related meta-information, many websites annotate e-commerce related data (Products, Offers, and Reviews) as well as contact information (LocalBusiness, Organization, PostalAddress). More concretely, the October 2016 corpus includes more than 682 million product records originating from 249 thousand websites which use the schema.org vocabulary. The new release contains postal address data for more than 291 million entities originating from 338 thousand websites. Furthermore, the content describing hotels has doubled in size in this release, with a total of 61 million hotel descriptions.
>> Visualizations of the main adoption trends concerning the different 
>> annotation formats, popular schema.org, as well as RDFa classes 
>> within the time span 2012 to 2016 are found at
>> http://webdatacommons.org/structureddata/#toc8
>> 
>> Download:
>> The overall size of the October 2016 Microdata, RDFa, Embedded JSON-LD, and Microformat data sets is 44.2 billion RDF quads. For download, we split the data into 9,661 files with a total size of 987 GB. 
>> 
>> http://webdatacommons.org/structureddata/2016-10/stats/how_to_get_the
>> _data.html In addition, we have created for over 40 different 
>> schema.org classes separate files, including all quads from pages, deploying at least once the specific class.
>> 
>> http://webdatacommons.org/structureddata/2016-10/stats/schema_org_sub
>> sets.html
>> 
>> Lots of thanks to: 
>> 
>> + the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project. 
>> + the Any23 project for providing their great library of structured data parsers. 
>> + Amazon Web Services in Education Grant for supporting WebDataCommons. 
>> + the Ministry of Economy, Research and Arts of Baden – Württemberg which supported by means of the ViCe project the extraction and analysis of the October 2016 corpus.
>> 
>> 
>> Have fun with the new data set. 
>> 
>> Cheers,
>> Anna Primpeli, Robert Meusel, and Chris Bizer
> 
> 

Received on Monday, 23 January 2017 16:02:20 UTC