Re: Semantic Web Search engines/Billion Triple Challenge or other data-sets? from Martin Hepp on 2016-06-23 (semantic-web@w3.org from June 2016)

From: Martin Hepp <mfhepp@gmail.com>
Date: Fri, 24 Jun 2016 00:21:50 +0200
To: Simon Spero <sesuncedu@gmail.com>, Harry Halpin <hhalpin@ibiblio.org>
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <91E07038-5FC9-4FD9-BCCF-8FEBBEFD27F9@gmail.com>

I would be very careful to consider the WebDataCommons representative of the Web, because the underlying CommonCrawl includes only a small portion of the detail pages of database-driven Web sites, which often account for a major part of structured data.

For an analysis in the e-commerce domain, see our COLD2015 paper:

    http://www.heppnetz.de/files/commoncrawl-cold2015.pdf


See also the threads

    http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html and 
    http://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html


But I suspect that the fundamental problem exists in other domains as well.

Martin

-----------------------------------
martin hepp  http://www.heppnetz.de
mhepp@computer.org          @mfhepp




> On 22 Jun 2016, at 17:47, Simon Spero <sesuncedu@gmail.com> wrote:
> 
> This data series is derived from Common Crawl. The most recent extract is from April, using data from November 2015, so whether it is a realistic snapshot of 2016 is a nice question :
> 
> http://webdatacommons.org/structureddata/
> 
> Simon
> 
> On Jun 22, 2016 11:07 AM, "Harry Halpin" <hhalpin@ibiblio.org> wrote:
> Are there any data-sets available that are realistic snapshots of the state of linked data in 2016?
> 
> I used to search using Sindice, but it's been down a while.
> 
> I see Swoogle is still up (http://swoogle.umbc.edu) but not sure if it's updated its index.
> 
> Also, a bunch of triples in the raw would also be fine, ala the BTC (but a recent data-set, at least 2014 or later).
> 
>   yours,
>    harry
>

Received on Thursday, 23 June 2016 22:22:22 UTC