Re: skos in billion-triple-challenge data from Ed Summers on 2010-09-14 (public-esw-thes@w3.org from September 2010)

From: Ed Summers <ehs@pobox.com>
Date: Mon, 13 Sep 2010 21:04:45 -0400
To: Antoine Isaac <aisaac@few.vu.nl>
Cc: public-esw-thes@w3.org
Message-ID: <AANLkTi=sje+EY_ofHf6KDeAa+gwx84htkqA1Escvh_xk@mail.gmail.com>

On Sat, Sep 11, 2010 at 6:30 AM, Antoine Isaac <aisaac@few.vu.nl> wrote:
> That's really cool indeed! Yet it's quite puzzling: I don't know what kind
> of bias there is in this BTC dataset, but there seems to be a strange
> selection being made. To take a graph we know both quite well, it's just
> impossible that the full id.loc.gov contained so few as 27,392 SKOS triples.
> Or have they captured a state in which id.loc.gov did *not* contain LCSH?
> Do you have an idea?

I re-downloaded the Billion Triple Challenge data files, since some of
the gz files were corrupted (thanks Antoine), and re-ran the stats,
which you can find again at:

  http://gist.github.com/574700

It's still not the ~2M triples from id.loc.gov, but it us up to 40,376
now. It's important to note that this number only includes assertions
using SKOS predicates and classes, not other assertions related to the
concept in vocabularies such as DublinCore. Oddly, the total number of
SKOS triples for 2010 is 2,888,523, whereas for 2009 it was
21,883,510. I haven't really investigated much to see what accounts
for the drop off.

I like Dan's idea of using the BTC skos data as a mechanism for doing
a a more targeted crawl of SKOS data on the web. I'm more than a bit
curious how easy it would be in practice.

//Ed

Received on Tuesday, 14 September 2010 01:05:13 UTC