Re: LOD Cloud Cache Stats from Kingsley Idehen on 2011-04-06 (public-lod@w3.org from April 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 06 Apr 2011 09:07:49 -0400
To: Hugh Glaser <hg@ecs.soton.ac.uk>
CC: "<nathan@webr3.org>" <nathan@webr3.org>, "public-lod@w3.org" <public-lod@w3.org>, "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <4D9C65A5.5030202@openlinksw.com>
On 4/6/11 8:30 AM, Hugh Glaser wrote:
> On 4 Apr 2011, at 15:16, Kingsley Idehen wrote:
>
>> On 4/4/11 10:06 AM, Nathan wrote:
>>> Kingsley Idehen wrote:
>>>> On 4/3/11 11:41 PM, Nathan wrote:
>>>>> Hi Kinglsey, All,
>>>>>
>>>>> Incoming open request, could anybody provide similar statistics for the usage of each datatype in the wild (e.g. the xsd types, xmlliteral and rdf plain literal)?
>>>>>
>>>>> Ideally Kingsley, could you provide a breakdown from the lod cloud cache? would be very very useful to know.
>>>>>
>>>>> Best&  TIA,
>>>>>
>>>>> Nathan
>>>>>
>>>>> Kingsley Idehen wrote:
>>>>>> I've knocked up a Google spreadsheet that contains stats about our 21 Billion Triples+ LOD cloud cache.
>>>>> ...
>>>>>> https://spreadsheets.google.com/ccc?key=0AihbIyhlsQSxdHViMFdIYWZxWE85enNkRHJwZXV4cXc&hl=en -- LOD Cloud Cache SPARQL stats queries and results
>>>> Nathan,
>>>>
>>>> The typed literals used in>   10k triples:
>>>>
>>>> count    datatype IRI
>>>> 11308    xsd:anyURI
>>>> 12553http://dbpedia.org/datatype/day
>>>> 12788http://dbpedia.org/ontology/day
>>>> 15875http://dbpedia.org/ontology/usDollar
>>>> 18228http://dbpedia.org/datatype/usDollar
>>>> 20828http://europeanaconnect.eu/voc/fondazione/sgti#fondazioneNot
>>>> 22934http://statistics.data.gov.uk/def/administrative-geography/StandardCode
>>>> 23368http://www.w3.org/2001/XMLSchema#date
>>>> 30695http://dbpedia.org/datatype/inhabitantsPerSquareKilometre
>>>> 31662http://dbpedia.org/datatype/second
>>>> 35506http://dbpedia.org/datatype/kilometre
>>>> 57409http://www.w3.org/2001/XMLSchema#int
>>>> 160117http://stitch.cs.vu.nl/vocabularies/rameau/RecordNumber
>>>> 632256http://www.w3.org/2001/XMLSchema#anyURI
>>>> 1175435  xsd:string
>>>> 1696035http://data.ordnancesurvey.co.uk/ontology/postcode/Postcode
>>>> 70194534http://www.openlinksw.com/schemas/virtrdf#Geometry
>>>> 120147725http://www.w3.org/2001/XMLSchema#string
>>>>
>>>> Spreadsheet will be updated too.
>>>>
>>> Thanks Kingsley, very much appreciated! :)
>>>
>>> I have to admit I'm surprised by the lack of xsd:double and xsd:decimal in the two stats sets, and also the inclusion of some datatypes I'd never even heard of!
>>>
>>> Are there any virtuozo specific nuances which do some conversion, or are all of these as found in the serialized RDF?
>>>
>>> also is xsd:string automatically set for all plain literals (with / without langs?)
>>>
>>> Cheers,
>>>
>>> Nathan
>>>
>>>
>> Data comes from internal table in Virtuoso. Note, a threshold has been set so what you are seeing is a picture relative to the total amount of data (21 Billion+ triples).
> Hi Kingsley.
> Thanks.
> So these numbers are absolute numbers of some fraction of the dataset?

At a point in time bearing in mind we continue to load datasets as we 
discover them.
> It would be good if that could be made clear - I certainly read your first message as being over the whole set, as I think did Dave and Nathan.

I truly believe the SPARQL queries and column text make these numbers 
crystal clear. We've gone for values within a range. We'll ultimately 
make a VoiD graph for this instance.

> Perhaps it would be clearer to present as a percentage?

Yes, that's an idea.

> Also, if that is the case, is it a random sample, or might there be some artefacts in the system that skew towards some graphs or datasets?

Yes, large datasets like RPI's (6.4 Billion) do skew the dataset somewhat.

We are considering exposing a WebID protected SPARQL endpoint so that 
specific Agents are given wider access to the data space. Said Agents 
could then be allowed to do things like:

1. Unrestricted counts
2. Sponging -- i.e., adding datasets to the corpus that might not have 
been discovered
3. SPARUL -- which goes beyond Sponging i.e, let specific Agent massage 
the data via SPARUL etc..

More than anything else, we are trying to establish a starting point for 
these matters.

I published the spreadsheets with the following in mind:

1. provide information about the magnitude of the data space
2. provide context for many of the demonstrations I publish using the 
blue faceted browser based description pages.

As you know, performing 'Precision Find' against a data space of this 
magnitude, where the starting point is a Text Pattern and the beholder 
is allowed to subjectively disambiguate across Type and other Attribute 
dimensions while filtering is a significant capability in the Linked 
Data realm. Ditto broader Big Data realm.

I am hoping we get to 26 Billion+ once we get the Linked Life Dataset 
(approx 5 Billion triples). Only hold up right now is actual release of 
the Dataset. We also have a significant number of triples coming in from 
the YAGO2 data set.


Kingsley
> Best
> Hugh
>>
>> -- 
>>
>> Regards,
>>
>> Kingsley Idehen	
>> President&   CEO
>> OpenLink Software
>> Web: http://www.openlinksw.com
>> Weblog: http://www.openlinksw.com/blog/~kidehen
>> Twitter/Identi.ca: kidehen
>>
>>
>>
>>
>>
>>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Wednesday, 6 April 2011 13:10:16 UTC