Re: skos in billion-triple-challenge data from Antoine Isaac on 2010-09-15 (public-lld@w3.org from September 2010)

From: Antoine Isaac <aisaac@few.vu.nl>
Date: Wed, 15 Sep 2010 17:34:34 +0200
To: Ed Summers <ehs@pobox.com>
CC: public-esw-thes@w3.org, Dan Brickley <danbri@danbri.org>, public-lld <public-lld@w3.org>
Message-ID: <4C90E78A.1050600@few.vu.nl>

Hi Ed,  Dan,

[Continuing the thread at http://lists.w3.org/Archives/Public/public-esw-thes/2010Sep/0004.html, and ccing the Library Linked Data list, as this may be of some relevance]

I agree that in principle this should be really useful, but I still have some doubts looking at the figures, even after your first fix.
As you say we're still far from the full SKOS figures for the id.loc.gov dataset. Another example, at [1] I see 1401 SKOS triples at metadataregistry.org, this seems really small.

In the stats you extracted for mappings [2], there are also some very surprising figures, like 1,143 links between id.loc.gov and stitch.cs.vu.nl. The amount of mappings between these two sets has never been below 55K.

Dan's suggestion

> I also have the sense that these big general crawls can be a bit lumpy/quirky. However they can give ud starting points for recrawling more specific targets more comprehensively.

made me investigate a bit. I've looked at the description for the Billion Triple Challenge data [3], which gives some precisions on the process:

> The major part of the dataset was crawled during March/April 2010 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson

I've followed their source and checked on Sindice the number of SKOS concepts there. With the query [4] I get 122.33K concepts. If I'm right, this is less than the number of concepts in LCSH (or RAMEAU) alone, and yet Sindice is supposed to harvest them. I guess we have something wrong in the way we publish our datasets here...

Cheers,

Antoine

[1]  http://gist.github.com/574700
[2] http://gist.github.com/578370
[3] http://km.aifb.kit.edu/projects/btc-2010/
[4]http://www.sindice.com/search?q=rdf%3Atype&qv=http%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23Concept&qt=ifp

> On Sat, Sep 11, 2010 at 6:30 AM, Antoine Isaac <aisaac@few.vu.nl> wrote:
>> > That's really cool indeed! Yet it's quite puzzling: I don't know what kind
>> > of bias there is in this BTC dataset, but there seems to be a strange
>> > selection being made. To take a graph we know both quite well, it's just
>> > impossible that the full id.loc.gov contained so few as 27,392 SKOS triples.
>> > Or have they captured a state in which id.loc.gov did *not* contain LCSH?
>> > Do you have an idea?
> I re-downloaded the Billion Triple Challenge data files, since some of
> the gz files were corrupted (thanks Antoine), and re-ran the stats,
> which you can find again at:
>
>   http://gist.github.com/574700
>
> It's still not the ~2M triples from id.loc.gov, but it us up to 40,376
> now. It's important to note that this number only includes assertions
> using SKOS predicates and classes, not other assertions related to the
> concept in vocabularies such as DublinCore. Oddly, the total number of
> SKOS triples for 2010 is 2,888,523, whereas for 2009 it was
> 21,883,510. I haven't really investigated much to see what accounts
> for the drop off.
>
> I like Dan's idea of using the BTC skos data as a mechanism for doing
> a a more targeted crawl of SKOS data on the web. I'm more than a bit
> curious how easy it would be in practice.
>
> //Ed
>

> I meant to add that I also ran some statistics for host names that
> link to each other using skos mapping propertiest:
>
>    http://gist.github.com/578370
>
> //Ed
>

Received on Wednesday, 15 September 2010 15:35:10 UTC