W3C home > Mailing lists > Public > public-dwbp-comments@w3.org > January 2016

Re: Subsetting data

From: Rob Atkinson <rob@metalinkage.com.au>
Date: Fri, 01 Jan 2016 23:12:36 +0000
Message-ID: <CACfF9LzfGwMosZ1xOd7ZMaAuMq1JxYs8EDg0DUYAyg_VOOy5UQ@mail.gmail.com>
To: Peter Baumann <p.baumann@jacobs-university.de>, Phil Archer <phila@w3.org>, Manolis Koubarakis <koubarak@di.uoa.gr>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>, Annette Greiner <amgreiner@lbl.gov>, Eric Stephan <ericphb@gmail.com>, "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>, public-dwbp-comments@w3.org
>From my reading of this conversation - as someone on the fringes who has
played with a fair bit of the implementation practicalities -  but is
primarily interested in identifying and promulgating best practices to a
wider audience - there is a lot of discussion about the actual meanings of
terms - so you may not want to develop a formal ontology - but you will
need to define these terms before anyone will make sense of the discussion.

in particular - I still feel there is not a fully developed consensus and
consistent terminology usage around the distinction between
1) a conceptual query that embodies specific semantics - such as "the
latest reported temperature of the air using methdology X at location Y "
 - obviously such things need identifiers so we can repeat them and attach
useful metadata to them
2) The results of such a query (in this case a subset of the air
temperature reading record set, starting from T1 and updated every  T
minutes )

I think there is a consensus that the actual query mechanism should be
decoupled by URI dereferencing - and not part of the URI.

not sure I can see a consensus regarding the endpoint of the query - if
this is part of the dereferencing - so the dereferencing results in a
composite entity - which is the enpoint, the actual query used at that
endpoint and the result returned - then those things are all properties of
the "subset" perhaps?  Of course we need to separate the query and the
result - because the result may be huge - and we may invoke the query at a
different time to when we retrieved it.  At this point I could model a
system - but I would be struggling to know exactly what terms the WG is
using for the different parts of the puzzle - and would want to cite those

hmm... during implementation being able to cite the elements of this
information model via a URI would be important.

So if the WG isnt able to define such an ontology, but one is needed to
implement a system that implements the reasonably complex semantics
involved, who would develop such an ontology? How do you stop N ad-hoc
ontologies emerging from N implementations of these best practices?

For the record, I have played with integrating VoiD, RDF Datacube, Linked
Data API and IETF URL templating and been able to handle the dereferencing
aspects and having all the metata accessible - but one thing missing is a
lightweight ontology to be able to define whether an endpoint returns a
subset of a resource, and what type of subset. VoiD supports type-based
partitioning as well as overlapping subsets - but this isnt quite powerful
enough to handle the sort of use cases here. You could perhaps interpret
the existence of a RDF-QB dimension description attached to an endpoint as
an implicit statement the endpoint provides subsetting on dimensions - but
would that scale to handle cases where subsets are well known and QB is
overkill and too high and entry bar?

Rob Atkinson

On Sat, 2 Jan 2016 at 01:53 Peter Baumann <p.baumann@jacobs-university.de>

> have added comments and filled placeholders. As I do not have write
> permissionsthis has created a fork:
> https://github.com/w3c/sdw/compare/gh-pages...pebau:patch-1
> -Peter
> On 2015-12-30 19:31, Phil Archer wrote:
> > At various times in recent months I have promised to look into the topic
> of
> > persistent identifiers for subsets of data. This came up at the SDW F2F
> in
> > Sapporo but has also been raised by Annette in DWBP. In between festive
> > activities I've been giving this some thought and have tried to begin to
> > commit some ideas to a page [1].
> >
> > During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible way
> > forward, including its geo-temporal extensions defined by the OGC. There
> is
> > also the Linked Data API as a means of doing this, and what they both
> have in
> > common is that they offer an intermediate layer that turns a URL into a
> query.
> >
> > How do you define a persistent identifier for a subset of a dataset? IMO
> you
> > mint a URI and say "this identifies a subset of a dataset" - and then
> provide
> > a means of programmatically going from the URI to a query that returns
> the
> > subset. As long as you can replace the intermediate layer with another
> one
> > that also returns the same subset, we're done.
> >
> > The UK Government Linked Data examples tend to be along the lines of:
> >
> > http://transport.data.gov.uk/id/stations
> > returns a list of all stations in Britain.
> >
> > http://transport.data.gov.uk/id/stations/Manchester
> > returns a list of stations in Manchester
> >
> > http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
> > identifies Manchester Piccadilly station.
> >
> > All of that data of course comes from a single dataset.
> >
> > Does this work in the real worlds of meteorology and UBL/PNNL?
> >
> > Phil.
> >
> >
> >
> >
> > [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
> >
> >
> >
> >
> --
> Dr. Peter Baumann
>  - Professor of Computer Science, Jacobs University Bremen
>    www.faculty.jacobs-university.de/pbaumann
>    mail: p.baumann@jacobs-university.de
>    tel: +49-421-200-3178, fax: +49-421-200-493178
>  - Executive Director, rasdaman GmbH Bremen (HRB 26793)
>    www.rasdaman.com, mail: baumann@rasdaman.com
>    tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882
> "Si forte in alienas manus oberraverit hec peregrina epistola incertis
> ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli
> destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD
> 1083)
Received on Friday, 1 January 2016 23:13:29 UTC

This archive was generated by hypermail 2.3.1 : Friday, 1 January 2016 23:13:30 UTC