W3C home > Mailing lists > Public > public-dwbp-comments@w3.org > January 2016

Re: Subsetting data

From: Peter Baumann <p.baumann@jacobs-university.de>
Date: Sun, 3 Jan 2016 10:21:28 +0100
To: Rob Atkinson <rob@metalinkage.com.au>, <Simon.Cox@csiro.au>, <phila@w3.org>, <public-sdw-comments@w3.org>, <public-dwbp-comments@w3.org>
Message-ID: <5688E818.4030800@jacobs-university.de>
+1
first of all, a model is needed for what is to be subset - metadata records,
vector bundles, pixel arrays, etc. Sometimes this is existing already, such as
for coverages (which allows to handle continuous fields as well, BTW) and the
corresponding subsetting service, WCS. Note that the general coverage query
language, WCPS, is clearly separate, so a decision can be made on the complexity
to be supported. Although I guess that, after subsetting, aggregation, fusion,
etc. will get on the agenda.

On 2016-01-03 00:46, Rob Atkinson wrote:
> I share this unease of equating query result with subsets. Maybe there is a
> theory which makes this work - but you can generate a query over a continuous
> field - maybe thats just a set with an infinite number of members? Queries can
> also operate on a set (or subset) and perform arbitrary functions - you could
> generate a continuous field from a set of discrete point observations for
> example.  If these are "subsets" - then a statement of the basic theory of
> what a subset means is needed before the debate can move (and any useful
> explanation of the the conclusions can be shared.) 
>  
> If not - then there is some set of relationships that need to be named between
> separate types of things identified so far in this thread:

I give it a try below for the specific case of coverages, just as an example:

> 1) query (logical)

a WCS (or WCPS) operation

> 2) query result (logical - what it means)

a coverage or a scalar; in case of WCPS, can be a set thereof

> 3) query endpoint (discovered after URI resolution)

the server which offers the coverage under consideration. Has a Capabilities
document to inform about its holdings + service capabilities.

> 4) query invocation (used to get result)

the concrete request instance, which can be encoded using KVP, POST, SOAP, REST
(in future: JSON)

> 5) encoded artefact containing query result

a coverage encoded in a suitable format, such as XML, HDF, TIFF

> 6) the set of data being queried

generally: the server's offerings. Note that this offering is concisely defined
in WCS through a single UML diagram.
here (as subsetting addresses 1 coverage, as it is in WCS): the coverage sitting
on a server (ie, the resource), which has a unique id

> 7) logical subsets of data that may be named

given by coordinates within the coverage, along each of its axes (optional)

> 8) actual subset retrieved by a query (identifier of a query result)

note these are 2 distinct things: the subset and its identifier.
No result identifier is provided by WCS, as there is no global model ->
playground for innovation in W3C.

-Peter

>
> AFAICT these are all different things - but may be tighly coupled in some
> cases: such as a simple query on dimensions of a set - in which case the
> logical query can be used as a proxy for most of the other things - but in
> other cases these need separate metadata and citation. 
>
> How do we handle the simple cases without making the general case an exercise
> in ad-hoc overloading or extension of the model?
>
> It doesnt feel to me that there is a neat web-developer-friendly story yet.
> The distinctions between different types of things and how they map onto the
> Web architecture _and_ the requirements would at least need a worked example
> to show how things relate - and this really does mean choosing or developing
> an ontology for this. We may decide our expressiveness is limited to isPartOf
>  - but that then means that only certain types of queries make sense - which
> may be OK but needs explanation. We could point to set theory and say "at this
> point we need an ontology to handle these concepts - this is future work".  I
> dont think its useful to say "you just gotta embed some links" - if you look
> at implementations of Linked Data the spatial and topological relationships
> are variable - abusing semantics of other ontologies ( sameAs especially) -
> its still the Wild West, after a heavy night drinking. IMHO the binding of
> services that support queries onto sets of data is simply too fundamental a
> concern to leave interoperability effectively unsupported.
>
> Cheers
> Rob
>
>
> On Sat, 2 Jan 2016 at 20:17 Peter Baumann <p.baumann@jacobs-university.de
> <mailto:p.baumann@jacobs-university.de>> wrote:
>
>     looking at queries is a nicely general approach (which I like), it is just
>     that this transcends subsetting:
>     Subset = set of elements which have been preexisting (ex: vectors from a
>     vector bundle)
>     Query in addition includes
>     - fusion = combination of more than one object involved, such as image overlay
>     - aggregation = delivering scalars, something maybe not in the original
>     object (such as a feature bundle, which is not a scalar) -> type change
>     - any other type of processing (such as rasterizing vectors, or
>     vectorizing rasters) -> type change
>
>     Note that this narrow definition of subset includes an OGC WFS / Filter
>     Encoding right away, whereas the "extended view" does not.
>
>
>     -Peter
>
>
>
>     On 2016-01-02 01:10, Simon.Cox@csiro.au <mailto:Simon.Cox@csiro.au> wrote:
>>     > to be persistent, identifiers should not include queries against a
>>     specific API or query endpoint. 
>>
>>     For sure. I didn't say anything about the form of the query. It may not
>>     even look like a query. Opensearch is an obvious model for
>>     implementation-independent syntax (after all it's just key-value pairs).
>>
>>     However, I do think it is worth keeping the notion of subset=query result
>>     in view. Sure, some query results may be more persistent and therefore
>>     worthy of denotation with a special identifier. But the same subset will
>>     also be the result of some query anyway. That's just an example of
>>     non-unique identifiers.
>>
>>     Simon J D Cox
>>
>>     Research Scientist
>>
>>     Environmental Information Infrastructures
>>
>>     Land and Water
>>
>>     CSIRO
>>
>>      
>>
>>     E simon.cox@csiro.au <mailto:simon.cox@csiro.au> T +61 3 9545 2365 M +61
>>     403 302 672
>>
>>        Physical: Reception Central, Bayview Avenue, Clayton, Vic 3168
>>
>>        Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
>>
>>        Postal: Private Bag 10, Clayton South, Vic 3169
>>
>>     people.csiro.au/Simon-Cox <http://people.csiro.au/Simon-Cox>
>>
>>     orcid.org/0000-0002-3884-3420 <http://orcid.org/0000-0002-3884-3420>
>>
>>     researchgate.net/profile/Simon_Cox3
>>     <http://researchgate.net/profile/Simon_Cox3>
>>
>>      
>>
>>     *
>>      
>>     *
>>     --------------------------------------------------------------------------------
>>     *From:* Phil Archer
>>     *Sent:* Friday, 1 January 2016 9:05:25 AM
>>     *To:* Cox, Simon (L&W, Clayton); public-sdw-comments@w3.org
>>     <mailto:public-sdw-comments@w3.org>; public-dwbp-comments@w3.org
>>     <mailto:public-dwbp-comments@w3.org>
>>     *Subject:* Re: Subsetting data
>>
>>
>>
>>     On 30/12/2015 21:26, Simon.Cox@csiro.au <mailto:Simon.Cox@csiro.au> wrote:
>>     > Another way of looking at it is that a query, encoded as a URI pattern,
>>     defines an implicit set of potential URIs, each of which denotes a subset.
>>
>>     True, but to be persistent, identifiers should not include queries
>>     against a specific API or query endpoint. That, for me, is the key
>>     point. OpenSearch provides a model where a query is included in a URL
>>     that can be considered persistent because there is a layer of
>>     indirection that could be changed without the URL changing, but a URL
>>     that includes a SQL or SPARQL query directly must be considered
>>     ephemeral IMO.
>>
>>     Phil
>>
>>
>>     >
>>     > Simon J D Cox
>>     > Environmental Informatics
>>     > CSIRO Land and Water
>>     >
>>     > E simon.cox@csiro.au <mailto:simon.cox@csiro.au> T +61 3 9545 2365 M
>>     +61 403 302 672
>>     > Physical: Central Reception, Bayview Avenue, Clayton, Vic 3168
>>     > Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
>>     > Postal: Private Bag 10, Clayton South, Vic 3169
>>     > http://people.csiro.au/Simon-Cox
>>     > http://orcid.org/0000-0002-3884-3420
>>     > http://researchgate.net/profile/Simon_Cox3
>>     >
>>     > ________________________________
>>     > From: Phil Archer
>>     > Sent: Wednesday, 30 December 2015 6:31:16 PM
>>     > To: Manolis Koubarakis; 'public-sdw-comments@w3.org
>>     <mailto:public-sdw-comments@w3.org>'; Annette Greiner; Eric Stephan;
>>     Tandy, Jeremy; public-dwbp-comments@w3.org
>>     <mailto:public-dwbp-comments@w3.org>
>>     > Subject: Subsetting data
>>     >
>>     > At various times in recent months I have promised to look into the topic
>>     > of persistent identifiers for subsets of data. This came up at the SDW
>>     > F2F in Sapporo but has also been raised by Annette in DWBP. In between
>>     > festive activities I've been giving this some thought and have tried to
>>     > begin to commit some ideas to a page [1].
>>     >
>>     > During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible
>>     > way forward, including its geo-temporal extensions defined by the OGC.
>>     > There is also the Linked Data API as a means of doing this, and what
>>     > they both have in common is that they offer an intermediate layer that
>>     > turns a URL into a query.
>>     >
>>     > How do you define a persistent identifier for a subset of a dataset? IMO
>>     > you mint a URI and say "this identifies a subset of a dataset" - and
>>     > then provide a means of programmatically going from the URI to a query
>>     > that returns the subset. As long as you can replace the intermediate
>>     > layer with another one that also returns the same subset, we're done.
>>     >
>>     > The UK Government Linked Data examples tend to be along the lines of:
>>     >
>>     > http://transport.data.gov.uk/id/stations
>>     > returns a list of all stations in Britain.
>>     >
>>     > http://transport.data.gov.uk/id/stations/Manchester
>>     > returns a list of stations in Manchester
>>     >
>>     > http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
>>     > identifies Manchester Piccadilly station.
>>     >
>>     > All of that data of course comes from a single dataset.
>>     >
>>     > Does this work in the real worlds of meteorology and UBL/PNNL?
>>     >
>>     > Phil.
>>     >
>>     >
>>     >
>>     >
>>     > [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
>>     >
>>     >
>>     >
>>     >
>>     > --
>>     >
>>     >
>>     > Phil Archer
>>     > W3C Data Activity Lead
>>     > http://www.w3.org/2013/data/
>>     >
>>     > http://philarcher.org
>>     > +44 (0)7887 767755
>>     > @philarcher1
>>     >
>>     >
>>
>>     -- 
>>
>>
>>     Phil Archer
>>     W3C Data Activity Lead
>>     http://www.w3.org/2013/data/
>>
>>     http://philarcher.org
>>     +44 (0)7887 767755
>>     @philarcher1
>
>     -- 
>     Dr. Peter Baumann
>      - Professor of Computer Science, Jacobs University Bremen
>        www.faculty.jacobs-university.de/pbaumann
>     <http://www.faculty.jacobs-university.de/pbaumann>
>        mail: p.baumann@jacobs-university.de <mailto:p.baumann@jacobs-university.de>
>        tel: +49-421-200-3178, fax: +49-421-200-493178
>      - Executive Director, rasdaman GmbH Bremen (HRB 26793)
>        www.rasdaman.com <http://www.rasdaman.com>, mail: baumann@rasdaman.com <mailto:baumann@rasdaman.com>
>        tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882
>     "Si forte in alienas manus oberraverit hec peregrina epistola incertis ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD 1083)
>
>

-- 
Dr. Peter Baumann
 - Professor of Computer Science, Jacobs University Bremen
   www.faculty.jacobs-university.de/pbaumann
   mail: p.baumann@jacobs-university.de
   tel: +49-421-200-3178, fax: +49-421-200-493178
 - Executive Director, rasdaman GmbH Bremen (HRB 26793)
   www.rasdaman.com, mail: baumann@rasdaman.com
   tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882
"Si forte in alienas manus oberraverit hec peregrina epistola incertis ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD 1083)
Received on Sunday, 3 January 2016 09:22:14 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 3 January 2016 09:22:15 UTC