Re: Subsetting data from Krzysztof Janowicz on 2016-01-05 (public-dwbp-comments@w3.org from January 2016)

From: Krzysztof Janowicz <janowicz@ucsb.edu>
Date: Tue, 5 Jan 2016 10:13:45 -0800
To: Dan Brickley <danbri@google.com>, Clemens Portele <portele@interactive-instruments.de>, Rob Atkinson <rob@metalinkage.com.au>
Cc: Phil Archer <phila@w3.org>, Simon Cox <Simon.Cox@csiro.au>, amgreiner@lbl.gov, ericphb@gmail.com, jeremy.tandy@metoffice.gov.uk, koubarak@di.uoa.gr, public-dwbp-comments@w3.org, public-sdw-comments@w3.org
Message-ID: <568C07D9.6040209@ucsb.edu>
Hi Dan,

> Isn't a "subset" just a query result, or which there are effectively 
> an unlimited number?

I would say so.

> Storing a query so it can be re-run against evolving data has value. 
> Having a URI for that, perhaps less so.

There are many cases, particularly if dealing with streams of data, 
e.g., form sensors, where having URIs for subsets is very useful, and, 
in fact, there are many ways to do so. Some years ago we implemented one 
such solution at 52North where we developed a transparent (i.e., 
invisible) proxy on top of an endpoint that translates a URI minted in a 
specific way to return specific query results.  For instance, the URI 
http://my.authority.org/observations/samplingtimes/ont:time:relation:between,2008-01-10T14:00,2008-01-12T16:00/sensors/thermometer1/observedproperties/temperature 
points to the observation collection with all temperature observations 
from January 10th 2008 at 2pm until January 12th at 4pm made by 
thermometer1. Having such URIs (and proxies) is one way of mitigating 
the problem that the content  referenced by URIs should be as stable as 
possible which is (often) not the case for sensor data. In our case we 
worked on transparently mapping between SPARQL and OGC's Sensor 
Observation Services but the idea can (and has been) used in many other 
settings.

Happy new year.
Krzysztof

On 12/31/2015 03:09 AM, Dan Brickley wrote:
>
> Isn't a "subset" just a query result, or which there are effectively 
> an unlimited number?
>
> Storing a query so it can be re-run against evolving data has value. 
> Having a URI for that, perhaps less so.
>
> Dan
>
> On Thu, 31 Dec 2015, 08:14 Clemens Portele 
> <portele@interactive-instruments.de 
> <mailto:portele@interactive-instruments.de>> wrote:
>
>     Rob,
>
>     what you describe seems to apply to the dataset (resource) the
>     same way it would apply to any subset resource. I.e. are you
>     discussing a more general question, not the subsetting question?
>
>     Phil,
>
>     a (probably often unproblematic) restriction to the
>     temperature/uk/london or stations/manchester approach is that
>     there is only one path, so you end up with limitations on the
>     subsets. If you want to support multiple subsets, e.g. also
>     stations where high speed trains stop, stations that have a ticket
>     shop, etc. then there are several issues with a
>     /{dataset}/{subset}/…/{subset}/{object} approach. These include an
>     unclear URI scheme ("manchester" and "eurostar" would be on the
>     same path level), potential name collisions of subset names of
>     different subsetting categories, and multiple URIs for the same
>     feature/object.
>
>     Best regards,
>     Clemens
>
>
>>     On 31 Dec 2015, at 03:07, Rob Atkinson <rob@metalinkage.com.au
>>     <mailto:rob@metalinkage.com.au>> wrote:
>>
>>     I'm not a strong set-theoretician - but it strikes me there are
>>     some tensions here:
>>
>>     Does the identifier of a set mean that the members of that set
>>     are constant, known in advance and always retrievable?   Is a
>>     query endpoint a resource (does either URI or URL have meaning
>>     against a query that delivers real time data - including the use
>>     case of "at this point in time we think these things are members
>>     of this set?" )
>>
>>     If the subset is the result of a query - and you care that it is
>>     the same subset another time you look at it - are you actually
>>     assigning an identifier to the artefact - which is the query
>>     response, whose properties include the original query, where it
>>     was made, and the time it was made?
>>
>>     Can you define an ontology for terms like subset, query, response
>>     that you all agree on?
>>
>>     I share Phil's implicit concern that subsetting by type with URI
>>     patterns may not be universally applicable - IMHO that equates to
>>     a "sub-register" pattern, where a set has its members defined by
>>     some identifiable process (indepent of any query functions
>>     available) - which may include explicit subsets - for example by
>>     object type, or delegated registration processes. That probably
>>     fits the UK implementation better than a query-defined subset.
>>
>>     If subsets have some prior meaning - and a query is used to
>>     access then from a service endpint - then the query is a URL that
>>     needs to be bound to the object URI. AFAICT thats a very
>>     different thing to saying an arbitrary query result defines a
>>     subset of data.
>>
>>     I think you may, in general, assign an ID to the artefact which
>>     is the result of a query at a given time, and if you want to make
>>     that into something with more semantics then you need make it
>>     into a new type of object which can be described in terms of what
>>     it means. I think currently the conversation is conflating these
>>     two perspectives of "subset".
>>
>>     Cheers, and farewell to 2015.
>>     Rob Atkinson.
>>
>>
>>
>>
>>     On Thu, 31 Dec 2015 at 08:26 <Simon.Cox@csiro.au
>>     <mailto:Simon.Cox@csiro.au>> wrote:
>>
>>         Another way of looking at it is that a query, encoded as a
>>         URI pattern, defines an implicit set of potential URIs, each
>>         of which denotes a subset.
>>
>>         Simon J D Cox
>>         Environmental Informatics
>>         CSIRO Land and Water
>>
>>         E simon.cox@csiro.au <mailto:simon.cox@csiro.au> T +61 3 9545
>>         2365 M +61 403 302 672
>>         Physical: Central Reception, Bayview Avenue, Clayton, Vic 3168
>>         Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
>>         Postal: Private Bag 10, Clayton South, Vic 3169
>>         http://people.csiro.au/Simon-Cox
>>         http://orcid.org/0000-0002-3884-3420
>>         http://researchgate.net/profile/Simon_Cox3*
>>         *
>>         ------------------------------------------------------------------------
>>         *From:* Phil Archer
>>         *Sent:* Wednesday, 30 December 2015 6:31:16 PM
>>         *To:* Manolis Koubarakis; 'public-sdw-comments@w3.org
>>         <mailto:public-sdw-comments@w3.org>'; Annette Greiner; Eric
>>         Stephan; Tandy, Jeremy; public-dwbp-comments@w3.org
>>         <mailto:public-dwbp-comments@w3.org>
>>         *Subject:* Subsetting data
>>
>>         At various times in recent months I have promised to look
>>         into the topic
>>         of persistent identifiers for subsets of data. This came up
>>         at the SDW
>>         F2F in Sapporo but has also been raised by Annette in DWBP.
>>         In between
>>         festive activities I've been giving this some thought and
>>         have tried to
>>         begin to commit some ideas to a page [1].
>>
>>         During the CEO-LD meeting, Jeremy pointed to OpenSearch as a
>>         possible
>>         way forward, including its geo-temporal extensions defined by
>>         the OGC.
>>         There is also the Linked Data API as a means of doing this,
>>         and what
>>         they both have in common is that they offer an intermediate
>>         layer that
>>         turns a URL into a query.
>>
>>         How do you define a persistent identifier for a subset of a
>>         dataset? IMO
>>         you mint a URI and say "this identifies a subset of a
>>         dataset" - and
>>         then provide a means of programmatically going from the URI
>>         to a query
>>         that returns the subset. As long as you can replace the
>>         intermediate
>>         layer with another one that also returns the same subset,
>>         we're done.
>>
>>         The UK Government Linked Data examples tend to be along the
>>         lines of:
>>
>>         http://transport.data.gov.uk/id/stations
>>         returns a list of all stations in Britain.
>>
>>         http://transport.data.gov.uk/id/stations/Manchester
>>         returns a list of stations in Manchester
>>
>>         http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
>>         identifies Manchester Piccadilly station.
>>
>>         All of that data of course comes from a single dataset.
>>
>>         Does this work in the real worlds of meteorology and UBL/PNNL?
>>
>>         Phil.
>>
>>
>>
>>
>>         [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
>>
>>
>>
>>
>>         -- 
>>
>>
>>         Phil Archer
>>         W3C Data Activity Lead
>>         http://www.w3.org/2013/data/
>>
>>         http://philarcher.org <http://philarcher.org/>
>>         +44 (0)7887 767755
>>         @philarcher1
>>
>


-- 
Krzysztof Janowicz

Geography Department, University of California, Santa Barbara
4830 Ellison Hall, Santa Barbara, CA 93106-4060

Email: jano@geog.ucsb.edu
Webpage: http://geog.ucsb.edu/~jano/
Semantic Web Journal: http://www.semantic-web-journal.net
Received on Tuesday, 5 January 2016 18:14:19 UTC