W3C home > Mailing lists > Public > public-esw-thes@w3.org > September 2010

Re: skos in billion-triple-challenge data

From: Dan Brickley <danbri@danbri.org>
Date: Sat, 11 Sep 2010 12:44:33 +0200
Message-Id: <F1567859-63D1-4F8E-8435-03C9AB155C91@danbri.org>
Cc: Ed Summers <ehs@pobox.com>, "public-esw-thes@w3.org" <public-esw-thes@w3.org>
To: Antoine Isaac <aisaac@few.vu.nl>




On 11 Sep 2010, at 12:30, Antoine Isaac <aisaac@few.vu.nl> wrote:

> On 9/11/10 3:45 AM, Ed Summers wrote:
>> On a Friday whim (prompted by Dan Brickley) I downloaded the 2010
>> Billion Triple Challenge dataset to look and see how many SKOS
>> assertions there are in it, and from what domains. If you are
>> interested the results can be found at:
>> 
>>   http://gist.github.com/574700
>> 
>> //Ed
>> 
>> 
> 
> 
> Hi Ed,
> 
> That's really cool indeed! Yet it's quite puzzling: I don't know what kind of bias there is in this BTC dataset, but there seems to be a strange selection being made. To take a graph we know both quite well, it's just impossible that the full id.loc.gov contained so few as 27,392 SKOS triples. Or have they captured a state in which id.loc.gov did *not* contain LCSH?
> Do you have an idea?
> 

I also have the sense that these big general crawls can be a bit lumpy/quirky. However they can give ud starting points for recrawling more specific targets more comprehensively.

Dan


> Cheers,
> 
> Antoine
> 
Received on Saturday, 11 September 2010 10:44:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 11 September 2010 10:44:20 GMT