W3C home > Mailing lists > Public > public-lod@w3.org > April 2011

RE: How many instances of foaf:Person are there in the LOD Cloud?

From: Hogan, Aidan <aidan.hogan@deri.org>
Date: Wed, 13 Apr 2011 11:54:19 +0100
Message-ID: <316ADBDBFE4F4D4AA4FEEF7496ECAEF905AB2C64@EVS1.ac.nuigalway.ie>
To: "Michael Brunnbauer" <brunni@netestate.de>, "Bernard Vatant" <bernard.vatant@mondeca.com>
Cc: "Linking Open Data" <public-lod@w3.org>
> re
> 
> BTW: The note on http://wiki.foaf-project.org/w/DataSources that the
> Billion
> Triples Challenge 2009 contains "40 million FOAFs" is a bit
misleading. If
> you
> follow the link you can see that there are 39 mio "X a foaf:Person"
> assertions
> in the dataset which boils down to much less distinct foaf:Persons. We
> have
> ca. 40 mio "X a foaf:Person" assertions and ca. 3.5 mio distinct
> foaf:Persons.
> 

Just to throw an additional source into the ring: some stats on the top
25 classes and properties for a more recent SWSE crawl (May 2010) are
available at the end of this tech report:

http://www.deri.ie/fileadmin/documents/DERI-TR-2010-07-23.pdf -- p51

>From a crawl of 1.1 billion quads (4 million RDF/XML docs), we found 163
million *quadruples* with rdf:type as predicate and foaf:Person as
value. As Bernard has already said, this does not directly correspond
with number of unique members. Also, 1.1 billion quads is only a
sample... we try to sample an "evenish" number of documents from the
different domains to keep things "fair". (Details of the crawl are also
in the doc.)

A lot of data comes from hi5.com (which had much bigger than average
documents) and livejournal. See Table A.1 in the doc (p 50) for top 25
domains providing data. Again, the larger providers are only sampled.
Also, as Bernard alluded to, a lot of the FOAF data is of "low
quality"...

...oh, and last disclaimer: triple/quad counts mean very little when
taken out of context.

Cheers,
Aidan
Received on Wednesday, 13 April 2011 10:54:46 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:32 UTC