Re: [void-discussion] Representing dataset statistics

On Wednesday 29. January 2014 15.05.14 Richard Cyganiak wrote:
> Less is probably more there. Unless you have a very concrete need for the
> more complex constructs there (e.g., you have a federation framework that
> requires exactly those statistics), then I'd recommend sticking to the
> simplest constructs. If there is a particular number you want to include
> that cannot be expressed with a simple VoID property, it may be better to
> introduce a new property.
> 
> I say this because the more complex constructs (e.g., clever stuff with
> class and property partitions) tend to go unused and can be misleading.

So, just a quick note from me too, as I'm doing some clever data profiling stuff 
for my ph.d. ;-) Most of the proposed statistics here is useful for 
federation, as shown by Olaf Görlitz et al in their SPLENDID paper. However, 
as I'm computing it in my code, I can only note that it is pretty heavy to 
compute, and indeed, it is quite unlikely that people will do it unless the 
data providers have a very compelling reason to do it.

I've seen that in the last few days, Philip Stutz have been implementing 
cardinality caching in their Triplerush triple store. That's one case where it 
is likely that such statistics can be provided, since it becomes much more 
affordable to do. See https://github.com/uzh/triplerush

Another case where it is likely to exist is when the statistics is used for 
internal optimizations.

For all others, I think the key is to argue for *why* a certain piece of 
information is important to expose, keeping in mind that it is possibly 
demanding to produce. Just an IG recommendation is unlikely to suffice, I 
suspect, it would have to be on the form "to enable $foo, expose $bar".

Cheers,

Kjetil

Received on Wednesday, 29 January 2014 19:31:07 UTC