W3C home > Mailing lists > Public > public-esw-thes@w3.org > September 2010

Re: skos in billion-triple-challenge data

From: Ed Summers <ehs@pobox.com>
Date: Sat, 11 Sep 2010 07:26:24 -0400
Message-ID: <AANLkTinoO_D2w=dg_NdSNuSTMWCu=ELMM4woAyQ=6mb4@mail.gmail.com>
To: Antoine Isaac <aisaac@few.vu.nl>
Cc: public-esw-thes@w3.org
I just noticed that some of the the BTC gzipped files I downloaded
yesterday using puf [1] had some corruption problems. So it's quite
possible that these initial stats I provided were truncated in places.
I'm in the process of re-downloaded the .gz files (this time serially
with wget) and will rerun the stats once I've got them.

I guess it would be nice if the BTC folks included checksums for their
data files so people could make sure they traveled over the network
ok. But I guess having "unexpected end of file" errors when
decompressing is a good indicator too :-)

//Ed

[1] http://puf.sourceforge.net/

On Sat, Sep 11, 2010 at 6:30 AM, Antoine Isaac <aisaac@few.vu.nl> wrote:
> On 9/11/10 3:45 AM, Ed Summers wrote:
>>
>> On a Friday whim (prompted by Dan Brickley) I downloaded the 2010
>> Billion Triple Challenge dataset to look and see how many SKOS
>> assertions there are in it, and from what domains. If you are
>> interested the results can be found at:
>>
>>   http://gist.github.com/574700
>>
>> //Ed
>>
>>
>
>
> Hi Ed,
>
> That's really cool indeed! Yet it's quite puzzling: I don't know what kind
> of bias there is in this BTC dataset, but there seems to be a strange
> selection being made. To take a graph we know both quite well, it's just
> impossible that the full id.loc.gov contained so few as 27,392 SKOS triples.
> Or have they captured a state in which id.loc.gov did *not* contain LCSH?
> Do you have an idea?
>
> Cheers,
>
> Antoine
>
Received on Saturday, 11 September 2010 11:26:53 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 11 September 2010 11:26:53 GMT