W3C home > Mailing lists > Public > public-semweb-lifesci@w3.org > January 2014

Re: [void-discussion] Representing dataset statistics

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Wed, 29 Jan 2014 11:48:15 -0800
Message-ID: <CALcEXf6kW=p2YZ+nTfNh2DWueD6dSKaRP80b8zLQ-a=H8NSJyQ@mail.gmail.com>
To: void-discussion@googlegroups.com
Cc: HCLS IG <public-semweb-lifesci@w3.org>
one way to understand the contents of a dataset is to determine the
connectivity between the different elements of the data. One such way is to
indicate what object properties or what datatype properties are linked to
the types. another way is to show show different types are connected to one
another (and which relation is used to connect them).  from this you can
list them in tables or develop graphical overviews.


On Wed, Jan 29, 2014 at 11:30 AM, Kjetil Kjernsmo <kjetil@kjernsmo.net>wrote:

> On Wednesday 29. January 2014 15.05.14 Richard Cyganiak wrote:
> > Less is probably more there. Unless you have a very concrete need for the
> > more complex constructs there (e.g., you have a federation framework that
> > requires exactly those statistics), then I'd recommend sticking to the
> > simplest constructs. If there is a particular number you want to include
> > that cannot be expressed with a simple VoID property, it may be better to
> > introduce a new property.
> >
> > I say this because the more complex constructs (e.g., clever stuff with
> > class and property partitions) tend to go unused and can be misleading.
> So, just a quick note from me too, as I'm doing some clever data profiling
> stuff
> for my ph.d. ;-) Most of the proposed statistics here is useful for
> federation, as shown by Olaf Görlitz et al in their SPLENDID paper.
> However,
> as I'm computing it in my code, I can only note that it is pretty heavy to
> compute, and indeed, it is quite unlikely that people will do it unless the
> data providers have a very compelling reason to do it.
> I've seen that in the last few days, Philip Stutz have been implementing
> cardinality caching in their Triplerush triple store. That's one case
> where it
> is likely that such statistics can be provided, since it becomes much more
> affordable to do. See https://github.com/uzh/triplerush
> Another case where it is likely to exist is when the statistics is used for
> internal optimizations.
> For all others, I think the key is to argue for *why* a certain piece of
> information is important to expose, keeping in mind that it is possibly
> demanding to produce. Just an IG recommendation is unlikely to suffice, I
> suspect, it would have to be on the form "to enable $foo, expose $bar".
> Cheers,
> Kjetil
> --
> You received this message because you are subscribed to the Google Groups
> "void-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to void-discussion+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Michel Dumontier
Associate Professor of Medicine (Biomedical Informatics), Stanford
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
Received on Wednesday, 29 January 2014 19:49:03 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:53:07 UTC