Re: skos in billion-triple-challenge data

Hi Ed,  Dan,

[Continuing the thread at, and ccing the Library Linked Data list, as this may be of some relevance]

I agree that in principle this should be really useful, but I still have some doubts looking at the figures, even after your first fix.
As you say we're still far from the full SKOS figures for the dataset. Another example, at [1] I see 1401 SKOS triples at, this seems really small.

In the stats you extracted for mappings [2], there are also some very surprising figures, like 1,143 links between and The amount of mappings between these two sets has never been below 55K.

Dan's suggestion

> I also have the sense that these big general crawls can be a bit lumpy/quirky. However they can give ud starting points for recrawling more specific targets more comprehensively.

made me investigate a bit. I've looked at the description for the Billion Triple Challenge data [3], which gives some precisions on the process:

> The major part of the dataset was crawled during March/April 2010 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson

I've followed their source and checked on Sindice the number of SKOS concepts there. With the query [4] I get 122.33K concepts. If I'm right, this is less than the number of concepts in LCSH (or RAMEAU) alone, and yet Sindice is supposed to harvest them. I guess we have something wrong in the way we publish our datasets here...




> On Sat, Sep 11, 2010 at 6:30 AM, Antoine Isaac <> wrote:
>> > That's really cool indeed! Yet it's quite puzzling: I don't know what kind
>> > of bias there is in this BTC dataset, but there seems to be a strange
>> > selection being made. To take a graph we know both quite well, it's just
>> > impossible that the full contained so few as 27,392 SKOS triples.
>> > Or have they captured a state in which did *not* contain LCSH?
>> > Do you have an idea?
> I re-downloaded the Billion Triple Challenge data files, since some of
> the gz files were corrupted (thanks Antoine), and re-ran the stats,
> which you can find again at:
> It's still not the ~2M triples from, but it us up to 40,376
> now. It's important to note that this number only includes assertions
> using SKOS predicates and classes, not other assertions related to the
> concept in vocabularies such as DublinCore. Oddly, the total number of
> SKOS triples for 2010 is 2,888,523, whereas for 2009 it was
> 21,883,510. I haven't really investigated much to see what accounts
> for the drop off.
> I like Dan's idea of using the BTC skos data as a mechanism for doing
> a a more targeted crawl of SKOS data on the web. I'm more than a bit
> curious how easy it would be in practice.
> //Ed

> I meant to add that I also ran some statistics for host names that
> link to each other using skos mapping propertiest:
> //Ed

Received on Wednesday, 15 September 2010 15:35:10 UTC