- From: Ed Summers <ehs@pobox.com>
- Date: Mon, 13 Sep 2010 21:04:45 -0400
- To: Antoine Isaac <aisaac@few.vu.nl>
- Cc: public-esw-thes@w3.org
On Sat, Sep 11, 2010 at 6:30 AM, Antoine Isaac <aisaac@few.vu.nl> wrote: > That's really cool indeed! Yet it's quite puzzling: I don't know what kind > of bias there is in this BTC dataset, but there seems to be a strange > selection being made. To take a graph we know both quite well, it's just > impossible that the full id.loc.gov contained so few as 27,392 SKOS triples. > Or have they captured a state in which id.loc.gov did *not* contain LCSH? > Do you have an idea? I re-downloaded the Billion Triple Challenge data files, since some of the gz files were corrupted (thanks Antoine), and re-ran the stats, which you can find again at: http://gist.github.com/574700 It's still not the ~2M triples from id.loc.gov, but it us up to 40,376 now. It's important to note that this number only includes assertions using SKOS predicates and classes, not other assertions related to the concept in vocabularies such as DublinCore. Oddly, the total number of SKOS triples for 2010 is 2,888,523, whereas for 2009 it was 21,883,510. I haven't really investigated much to see what accounts for the drop off. I like Dan's idea of using the BTC skos data as a mechanism for doing a a more targeted crawl of SKOS data on the web. I'm more than a bit curious how easy it would be in practice. //Ed
Received on Tuesday, 14 September 2010 01:05:13 UTC