- From: Martin Hepp <mfhepp@gmail.com>
- Date: Fri, 24 Jun 2016 00:21:50 +0200
- To: Simon Spero <sesuncedu@gmail.com>, Harry Halpin <hhalpin@ibiblio.org>
- Cc: Semantic Web <semantic-web@w3.org>
I would be very careful to consider the WebDataCommons representative of the Web, because the underlying CommonCrawl includes only a small portion of the detail pages of database-driven Web sites, which often account for a major part of structured data. For an analysis in the e-commerce domain, see our COLD2015 paper: http://www.heppnetz.de/files/commoncrawl-cold2015.pdf See also the threads http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html and http://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html But I suspect that the fundamental problem exists in other domains as well. Martin ----------------------------------- martin hepp http://www.heppnetz.de mhepp@computer.org @mfhepp > On 22 Jun 2016, at 17:47, Simon Spero <sesuncedu@gmail.com> wrote: > > This data series is derived from Common Crawl. The most recent extract is from April, using data from November 2015, so whether it is a realistic snapshot of 2016 is a nice question : > > http://webdatacommons.org/structureddata/ > > Simon > > On Jun 22, 2016 11:07 AM, "Harry Halpin" <hhalpin@ibiblio.org> wrote: > Are there any data-sets available that are realistic snapshots of the state of linked data in 2016? > > I used to search using Sindice, but it's been down a while. > > I see Swoogle is still up (http://swoogle.umbc.edu) but not sure if it's updated its index. > > Also, a bunch of triples in the raw would also be fine, ala the BTC (but a recent data-set, at least 2014 or later). > > yours, > harry >
Received on Thursday, 23 June 2016 22:22:22 UTC