W3C home > Mailing lists > Public > semantic-web@w3.org > June 2016

Re: Semantic Web Search engines/Billion Triple Challenge or other data-sets?

From: Martin Hepp <mfhepp@gmail.com>
Date: Fri, 24 Jun 2016 00:21:50 +0200
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <91E07038-5FC9-4FD9-BCCF-8FEBBEFD27F9@gmail.com>
To: Simon Spero <sesuncedu@gmail.com>, Harry Halpin <hhalpin@ibiblio.org>
I would be very careful to consider the WebDataCommons representative of the Web, because the underlying CommonCrawl includes only a small portion of the detail pages of database-driven Web sites, which often account for a major part of structured data.

For an analysis in the e-commerce domain, see our COLD2015 paper:


See also the threads

    http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html and 

But I suspect that the fundamental problem exists in other domains as well.


martin hepp  http://www.heppnetz.de
mhepp@computer.org          @mfhepp

> On 22 Jun 2016, at 17:47, Simon Spero <sesuncedu@gmail.com> wrote:
> This data series is derived from Common Crawl. The most recent extract is from April, using data from November 2015, so whether it is a realistic snapshot of 2016 is a nice question :
> http://webdatacommons.org/structureddata/
> Simon
> On Jun 22, 2016 11:07 AM, "Harry Halpin" <hhalpin@ibiblio.org> wrote:
> Are there any data-sets available that are realistic snapshots of the state of linked data in 2016?
> I used to search using Sindice, but it's been down a while.
> I see Swoogle is still up (http://swoogle.umbc.edu) but not sure if it's updated its index.
> Also, a bunch of triples in the raw would also be fine, ala the BTC (but a recent data-set, at least 2014 or later).
>   yours,
>    harry
Received on Thursday, 23 June 2016 22:22:22 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:46 UTC