I would be very careful to consider the WebDataCommons representative of the Web, because the underlying CommonCrawl includes only a small portion of the detail pages of database-driven Web sites, which often account for a major part of structured data. For an analysis in the e-commerce domain, see our COLD2015 paper: http://www.heppnetz.de/files/commoncrawl-cold2015.pdf See also the threads http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html and http://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html But I suspect that the fundamental problem exists in other domains as well. Martin ----------------------------------- martin hepp http://www.heppnetz.de mhepp@computer.org @mfhepp > On 22 Jun 2016, at 17:47, Simon Spero <sesuncedu@gmail.com> wrote: > > This data series is derived from Common Crawl. The most recent extract is from April, using data from November 2015, so whether it is a realistic snapshot of 2016 is a nice question : > > http://webdatacommons.org/structureddata/ > > Simon > > On Jun 22, 2016 11:07 AM, "Harry Halpin" <hhalpin@ibiblio.org> wrote: > Are there any data-sets available that are realistic snapshots of the state of linked data in 2016? > > I used to search using Sindice, but it's been down a while. > > I see Swoogle is still up (http://swoogle.umbc.edu) but not sure if it's updated its index. > > Also, a bunch of triples in the raw would also be fine, ala the BTC (but a recent data-set, at least 2014 or later). > > yours, > harry >Received on Thursday, 23 June 2016 22:22:22 UTC
This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:46 UTC