- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Wed, 06 Apr 2011 09:07:49 -0400
- To: Hugh Glaser <hg@ecs.soton.ac.uk>
- CC: "<nathan@webr3.org>" <nathan@webr3.org>, "public-lod@w3.org" <public-lod@w3.org>, "semantic-web@w3.org" <semantic-web@w3.org>
On 4/6/11 8:30 AM, Hugh Glaser wrote: > On 4 Apr 2011, at 15:16, Kingsley Idehen wrote: > >> On 4/4/11 10:06 AM, Nathan wrote: >>> Kingsley Idehen wrote: >>>> On 4/3/11 11:41 PM, Nathan wrote: >>>>> Hi Kinglsey, All, >>>>> >>>>> Incoming open request, could anybody provide similar statistics for the usage of each datatype in the wild (e.g. the xsd types, xmlliteral and rdf plain literal)? >>>>> >>>>> Ideally Kingsley, could you provide a breakdown from the lod cloud cache? would be very very useful to know. >>>>> >>>>> Best& TIA, >>>>> >>>>> Nathan >>>>> >>>>> Kingsley Idehen wrote: >>>>>> I've knocked up a Google spreadsheet that contains stats about our 21 Billion Triples+ LOD cloud cache. >>>>> ... >>>>>> https://spreadsheets.google.com/ccc?key=0AihbIyhlsQSxdHViMFdIYWZxWE85enNkRHJwZXV4cXc&hl=en -- LOD Cloud Cache SPARQL stats queries and results >>>> Nathan, >>>> >>>> The typed literals used in> 10k triples: >>>> >>>> count datatype IRI >>>> 11308 xsd:anyURI >>>> 12553http://dbpedia.org/datatype/day >>>> 12788http://dbpedia.org/ontology/day >>>> 15875http://dbpedia.org/ontology/usDollar >>>> 18228http://dbpedia.org/datatype/usDollar >>>> 20828http://europeanaconnect.eu/voc/fondazione/sgti#fondazioneNot >>>> 22934http://statistics.data.gov.uk/def/administrative-geography/StandardCode >>>> 23368http://www.w3.org/2001/XMLSchema#date >>>> 30695http://dbpedia.org/datatype/inhabitantsPerSquareKilometre >>>> 31662http://dbpedia.org/datatype/second >>>> 35506http://dbpedia.org/datatype/kilometre >>>> 57409http://www.w3.org/2001/XMLSchema#int >>>> 160117http://stitch.cs.vu.nl/vocabularies/rameau/RecordNumber >>>> 632256http://www.w3.org/2001/XMLSchema#anyURI >>>> 1175435 xsd:string >>>> 1696035http://data.ordnancesurvey.co.uk/ontology/postcode/Postcode >>>> 70194534http://www.openlinksw.com/schemas/virtrdf#Geometry >>>> 120147725http://www.w3.org/2001/XMLSchema#string >>>> >>>> Spreadsheet will be updated too. >>>> >>> Thanks Kingsley, very much appreciated! :) >>> >>> I have to admit I'm surprised by the lack of xsd:double and xsd:decimal in the two stats sets, and also the inclusion of some datatypes I'd never even heard of! >>> >>> Are there any virtuozo specific nuances which do some conversion, or are all of these as found in the serialized RDF? >>> >>> also is xsd:string automatically set for all plain literals (with / without langs?) >>> >>> Cheers, >>> >>> Nathan >>> >>> >> Data comes from internal table in Virtuoso. Note, a threshold has been set so what you are seeing is a picture relative to the total amount of data (21 Billion+ triples). > Hi Kingsley. > Thanks. > So these numbers are absolute numbers of some fraction of the dataset? At a point in time bearing in mind we continue to load datasets as we discover them. > It would be good if that could be made clear - I certainly read your first message as being over the whole set, as I think did Dave and Nathan. I truly believe the SPARQL queries and column text make these numbers crystal clear. We've gone for values within a range. We'll ultimately make a VoiD graph for this instance. > Perhaps it would be clearer to present as a percentage? Yes, that's an idea. > Also, if that is the case, is it a random sample, or might there be some artefacts in the system that skew towards some graphs or datasets? Yes, large datasets like RPI's (6.4 Billion) do skew the dataset somewhat. We are considering exposing a WebID protected SPARQL endpoint so that specific Agents are given wider access to the data space. Said Agents could then be allowed to do things like: 1. Unrestricted counts 2. Sponging -- i.e., adding datasets to the corpus that might not have been discovered 3. SPARUL -- which goes beyond Sponging i.e, let specific Agent massage the data via SPARUL etc.. More than anything else, we are trying to establish a starting point for these matters. I published the spreadsheets with the following in mind: 1. provide information about the magnitude of the data space 2. provide context for many of the demonstrations I publish using the blue faceted browser based description pages. As you know, performing 'Precision Find' against a data space of this magnitude, where the starting point is a Text Pattern and the beholder is allowed to subjectively disambiguate across Type and other Attribute dimensions while filtering is a significant capability in the Linked Data realm. Ditto broader Big Data realm. I am hoping we get to 26 Billion+ once we get the Linked Life Dataset (approx 5 Billion triples). Only hold up right now is actual release of the Dataset. We also have a significant number of triples coming in from the YAGO2 data set. Kingsley > Best > Hugh >> >> -- >> >> Regards, >> >> Kingsley Idehen >> President& CEO >> OpenLink Software >> Web: http://www.openlinksw.com >> Weblog: http://www.openlinksw.com/blog/~kidehen >> Twitter/Identi.ca: kidehen >> >> >> >> >> >> -- Regards, Kingsley Idehen President& CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Received on Wednesday, 6 April 2011 13:08:15 UTC