- From: Ed Summers <ehs@pobox.com>
- Date: Sat, 11 Sep 2010 07:26:24 -0400
- To: Antoine Isaac <aisaac@few.vu.nl>
- Cc: public-esw-thes@w3.org
I just noticed that some of the the BTC gzipped files I downloaded yesterday using puf [1] had some corruption problems. So it's quite possible that these initial stats I provided were truncated in places. I'm in the process of re-downloaded the .gz files (this time serially with wget) and will rerun the stats once I've got them. I guess it would be nice if the BTC folks included checksums for their data files so people could make sure they traveled over the network ok. But I guess having "unexpected end of file" errors when decompressing is a good indicator too :-) //Ed [1] http://puf.sourceforge.net/ On Sat, Sep 11, 2010 at 6:30 AM, Antoine Isaac <aisaac@few.vu.nl> wrote: > On 9/11/10 3:45 AM, Ed Summers wrote: >> >> On a Friday whim (prompted by Dan Brickley) I downloaded the 2010 >> Billion Triple Challenge dataset to look and see how many SKOS >> assertions there are in it, and from what domains. If you are >> interested the results can be found at: >> >> http://gist.github.com/574700 >> >> //Ed >> >> > > > Hi Ed, > > That's really cool indeed! Yet it's quite puzzling: I don't know what kind > of bias there is in this BTC dataset, but there seems to be a strange > selection being made. To take a graph we know both quite well, it's just > impossible that the full id.loc.gov contained so few as 27,392 SKOS triples. > Or have they captured a state in which id.loc.gov did *not* contain LCSH? > Do you have an idea? > > Cheers, > > Antoine >
Received on Saturday, 11 September 2010 11:26:53 UTC