- From: Martin Hepp <mfhepp@gmail.com>
- Date: Fri, 24 Jun 2016 00:21:50 +0200
- To: Simon Spero <sesuncedu@gmail.com>, Harry Halpin <hhalpin@ibiblio.org>
- Cc: Semantic Web <semantic-web@w3.org>
I would be very careful to consider the WebDataCommons representative of the Web, because the underlying CommonCrawl includes only a small portion of the detail pages of database-driven Web sites, which often account for a major part of structured data.
For an analysis in the e-commerce domain, see our COLD2015 paper:
http://www.heppnetz.de/files/commoncrawl-cold2015.pdf
See also the threads
http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html and
http://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html
But I suspect that the fundamental problem exists in other domains as well.
Martin
-----------------------------------
martin hepp http://www.heppnetz.de
mhepp@computer.org @mfhepp
> On 22 Jun 2016, at 17:47, Simon Spero <sesuncedu@gmail.com> wrote:
>
> This data series is derived from Common Crawl. The most recent extract is from April, using data from November 2015, so whether it is a realistic snapshot of 2016 is a nice question :
>
> http://webdatacommons.org/structureddata/
>
> Simon
>
> On Jun 22, 2016 11:07 AM, "Harry Halpin" <hhalpin@ibiblio.org> wrote:
> Are there any data-sets available that are realistic snapshots of the state of linked data in 2016?
>
> I used to search using Sindice, but it's been down a while.
>
> I see Swoogle is still up (http://swoogle.umbc.edu) but not sure if it's updated its index.
>
> Also, a bunch of triples in the raw would also be fine, ala the BTC (but a recent data-set, at least 2014 or later).
>
> yours,
> harry
>
Received on Thursday, 23 June 2016 22:22:22 UTC