W3C home > Mailing lists > Public > public-dwbp-comments@w3.org > January 2016

Re: Subsetting data

From: Phil Archer <phila@w3.org>
Date: Fri, 1 Jan 2016 09:26:37 +0000
To: Peter Baumann <p.baumann@jacobs-university.de>, Clemens Portele <portele@interactive-instruments.de>, Rob Atkinson <rob@metalinkage.com.au>
Cc: Simon Cox <Simon.Cox@csiro.au>, koubarak@di.uoa.gr, public-sdw-comments@w3.org, amgreiner@lbl.gov, ericphb@gmail.com, jeremy.tandy@metoffice.gov.uk, public-dwbp-comments@w3.org
Message-ID: <5686464D.9010406@w3.org>


On 31/12/2015 09:33, Peter Baumann wrote:
> Hi all,
>
> there is work already in this realm which might be useful.
>
> - Stephan Proell has been working on subset identifiers in the context of RDA.

That's interesting. Can you put us in touch, please? I'll be engaging 
more fully with the RDA as of now and hope to got to the Japan plenary.

>
> - In the context of data/metadata linking there is work on connecting arrays
> into tables (ie, relational -> ISO SQL/MDA [1]), into hierarchies (ie, XML ->
> OGC WCPS [2]) and into RDF (have to find the paper). This allows to determine
> subsets via the resp. query mechanism, which I consider the most general way. As
> is the case with URLs already, different queries can point to the same result =
> "subset". Path expressions, as Phil used in his example, is one way of
> expressing subsets from composite entities.

This feels like live queries. Nothing wrong with that of course, but I'm 
trying to focus on persistent IDs for 'typical subsets' like 'latest 
satellite image of location X.'

>
> Generally, "subsetting" can mean many, many things. In the most basic case it
> denotes identifying a part of a coverage that is a coverage again: spatial and
> temporal subsetting in WCS Core, and also range subsetting, ie: extraction of
> bands/channels/variables from a coverage, resulting in a coverage again. With
> more general options, this can be trascended - such as retrieving _sets_ of
> pixels from an image _matrix_. You can replace "coverage" with anything where
> you wish to maintain some particular properties (array, set uniqueness,
> hierarchy, closure under a given ontology, ...).

So you'd bake some dimensions into the URI and they could persist even 
when your great grand child writes WCS Core 27.0 in 2216 ;-)

>
> re change of a subset target over time: that is of course always the case, any
> resource to which a URL points can change so this does not add substantial new
> problems. A subset may even yield an empty result at some time (such as maybe
> /UK/Edinburgh or EU/UK at some time ;-) ).

Indeed, yes.

Cheers

Phil

>
> re clashes etc: what you are talking about below is not subsetting, but fusion
> (a "join" or "union"). This is a different mechanism with different rules (cf
> ontology matching when merging two ontologies).
>
> Happy 2016,
> Peter
>
> [1] D. Misev, P. Baumann: /Extending the SQL Array Concept to Support Scientific
> Analytics/. Proc. Intl. Conf. on Scientific and Statistical Database Management
> (SSDBM'2014), June 30 - July 2, 2014, Aalborg, Denmark, paper #10
> [2] P. Baumann: The OGC Web Coverage Processing Service (WCPS) Standard.
> Geoinformatica, 14(4)2010, pp 447-479
>
>
> On 2015-12-31 09:07, Clemens Portele wrote:
>> Rob,
>>
>> what you describe seems to apply to the dataset (resource) the same way it
>> would apply to any subset resource. I.e. are you discussing a more general
>> question, not the subsetting question?
>>
>> Phil,
>>
>> a (probably often unproblematic) restriction to the temperature/uk/london or
>> stations/manchester approach is that there is only one path, so you end up
>> with limitations on the subsets. If you want to support multiple subsets, e.g.
>> also stations where high speed trains stop, stations that have a ticket shop,
>> etc. then there are several issues with a
>> /{dataset}/{subset}/…/{subset}/{object} approach. These include an unclear URI
>> scheme ("manchester" and "eurostar" would be on the same path level),
>> potential name collisions of subset names of different subsetting categories,
>> and multiple URIs for the same feature/object.
>>
>> Best regards,
>> Clemens
>>
>>
>>> On 31 Dec 2015, at 03:07, Rob Atkinson <rob@metalinkage.com.au
>>> <mailto:rob@metalinkage.com.au>> wrote:
>>>
>>> I'm not a strong set-theoretician - but it strikes me there are some tensions
>>> here:
>>>
>>> Does the identifier of a set mean that the members of that set are constant,
>>> known in advance and always retrievable?   Is a query endpoint a resource
>>> (does either URI or URL have meaning against a query that delivers real time
>>> data - including the use case of "at this point in time we think these things
>>> are members of this set?" )
>>>
>>> If the subset is the result of a query - and you care that it is the same
>>> subset another time you look at it - are you actually assigning an identifier
>>> to the artefact - which is the query response, whose properties include the
>>> original query, where it was made, and the time it was made?
>>>
>>> Can you define an ontology for terms like subset, query, response that you
>>> all agree on?
>>>
>>> I share Phil's implicit concern that subsetting by type with URI patterns may
>>> not be universally applicable - IMHO that equates to a "sub-register"
>>> pattern, where a set has its members defined by some identifiable process
>>> (indepent of any query functions available) - which may include explicit
>>> subsets - for example by object type, or delegated registration processes.
>>> That probably fits the UK implementation better than a query-defined subset.
>>>
>>> If subsets have some prior meaning - and a query is used to access then from
>>> a service endpint - then the query is a URL that needs to be bound to the
>>> object URI. AFAICT thats a very different thing to saying an arbitrary query
>>> result defines a subset of data.
>>>
>>> I think you may, in general, assign an ID to the artefact which is the result
>>> of a query at a given time, and if you want to make that into something with
>>> more semantics then you need make it into a new type of object which can be
>>> described in terms of what it means. I think currently the conversation is
>>> conflating these two perspectives of "subset".
>>>
>>> Cheers, and farewell to 2015.
>>> Rob Atkinson.
>>>
>>>
>>>
>>>
>>> On Thu, 31 Dec 2015 at 08:26 <Simon.Cox@csiro.au <mailto:Simon.Cox@csiro.au>>
>>> wrote:
>>>
>>>      Another way of looking at it is that a query, encoded as a URI pattern,
>>>      defines an implicit set of potential URIs, each of which denotes a subset.
>>>
>>>      Simon J D Cox
>>>      Environmental Informatics
>>>      CSIRO Land and Water
>>>
>>>      E simon.cox@csiro.au <mailto:simon.cox@csiro.au> T +61 3 9545 2365 M +61
>>>      403 302 672
>>>      Physical: Central Reception, Bayview Avenue, Clayton, Vic 3168
>>>      Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
>>>      Postal: Private Bag 10, Clayton South, Vic 3169
>>>      http://people.csiro.au/Simon-Cox
>>>      http://orcid.org/0000-0002-3884-3420
>>>      http://researchgate.net/profile/Simon_Cox3*
>>>
>>>      *
>>>      --------------------------------------------------------------------------------
>>>      *From:* Phil Archer
>>>      *Sent:* Wednesday, 30 December 2015 6:31:16 PM
>>>      *To:* Manolis Koubarakis; 'public-sdw-comments@w3.org
>>>      <mailto:public-sdw-comments@w3.org>'; Annette Greiner; Eric Stephan;
>>>      Tandy, Jeremy; public-dwbp-comments@w3.org
>>>      <mailto:public-dwbp-comments@w3.org>
>>>      *Subject:* Subsetting data
>>>
>>>      At various times in recent months I have promised to look into the topic
>>>      of persistent identifiers for subsets of data. This came up at the SDW
>>>      F2F in Sapporo but has also been raised by Annette in DWBP. In between
>>>      festive activities I've been giving this some thought and have tried to
>>>      begin to commit some ideas to a page [1].
>>>
>>>      During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible
>>>      way forward, including its geo-temporal extensions defined by the OGC.
>>>      There is also the Linked Data API as a means of doing this, and what
>>>      they both have in common is that they offer an intermediate layer that
>>>      turns a URL into a query.
>>>
>>>      How do you define a persistent identifier for a subset of a dataset? IMO
>>>      you mint a URI and say "this identifies a subset of a dataset" - and
>>>      then provide a means of programmatically going from the URI to a query
>>>      that returns the subset. As long as you can replace the intermediate
>>>      layer with another one that also returns the same subset, we're done.
>>>
>>>      The UK Government Linked Data examples tend to be along the lines of:
>>>
>>>      http://transport.data.gov.uk/id/stations
>>>      returns a list of all stations in Britain.
>>>
>>>      http://transport.data.gov.uk/id/stations/Manchester
>>>      returns a list of stations in Manchester
>>>
>>>      http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
>>>      identifies Manchester Piccadilly station.
>>>
>>>      All of that data of course comes from a single dataset.
>>>
>>>      Does this work in the real worlds of meteorology and UBL/PNNL?
>>>
>>>      Phil.
>>>
>>>
>>>
>>>
>>>      [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
>>>
>>>
>>>
>>>
>>>      --
>>>
>>>
>>>      Phil Archer
>>>      W3C Data Activity Lead
>>>      http://www.w3.org/2013/data/
>>>
>>>      http://philarcher.org <http://philarcher.org/>
>>>      +44 (0)7887 767755
>>>      @philarcher1
>>>
>>
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Friday, 1 January 2016 09:26:05 UTC

This archive was generated by hypermail 2.3.1 : Friday, 1 January 2016 09:26:06 UTC