- From: Hogan, Aidan <aidan.hogan@deri.org>
- Date: Wed, 13 Apr 2011 11:54:19 +0100
- To: "Michael Brunnbauer" <brunni@netestate.de>, "Bernard Vatant" <bernard.vatant@mondeca.com>
- Cc: "Linking Open Data" <public-lod@w3.org>
> re > > BTW: The note on http://wiki.foaf-project.org/w/DataSources that the > Billion > Triples Challenge 2009 contains "40 million FOAFs" is a bit misleading. If > you > follow the link you can see that there are 39 mio "X a foaf:Person" > assertions > in the dataset which boils down to much less distinct foaf:Persons. We > have > ca. 40 mio "X a foaf:Person" assertions and ca. 3.5 mio distinct > foaf:Persons. > Just to throw an additional source into the ring: some stats on the top 25 classes and properties for a more recent SWSE crawl (May 2010) are available at the end of this tech report: http://www.deri.ie/fileadmin/documents/DERI-TR-2010-07-23.pdf -- p51 >From a crawl of 1.1 billion quads (4 million RDF/XML docs), we found 163 million *quadruples* with rdf:type as predicate and foaf:Person as value. As Bernard has already said, this does not directly correspond with number of unique members. Also, 1.1 billion quads is only a sample... we try to sample an "evenish" number of documents from the different domains to keep things "fair". (Details of the crawl are also in the doc.) A lot of data comes from hi5.com (which had much bigger than average documents) and livejournal. See Table A.1 in the doc (p 50) for top 25 domains providing data. Again, the larger providers are only sampled. Also, as Bernard alluded to, a lot of the FOAF data is of "low quality"... ...oh, and last disclaimer: triple/quad counts mean very little when taken out of context. Cheers, Aidan
Received on Wednesday, 13 April 2011 10:54:46 UTC