RE: Subsetting data

Interesting discussion.

 

It seems to me that there are (at least) two different types of subset which I think is what Rob addressed:

 

1.       Subsets that are in some way stable – for example a worksheet in a spreadsheet workbook has identity, a name and other more or less fixed characteristics. 

2.       Subsets that are ephemeral – as in the case of query results

 

For the first type, URIs make sense; a subset of a dataset could be seen as just another dataset that is related with a isPartOf relationship to the bigger dataset.

By the way, I think that URI patterns may help a publisher to generate the URI but it won’t help the user to understand how it relates to a bigger entity – you cannot expect standard behaviour across publishers.

 

For the second type of subset, I agree with Dan that a URI for something that is not stable is of questionable value. In that case, wouldn’t the URI identify the query rather than its results?

 

Makx.

 

 

 

From: Dan Brickley [mailto:danbri@google.com] 
Sent: 31 December 2015 12:09
To: Clemens Portele <portele@interactive-instruments.de>; Rob Atkinson <rob@metalinkage.com.au>
Cc: Phil Archer <phila@w3.org>; Simon Cox <Simon.Cox@csiro.au>; amgreiner@lbl.gov; ericphb@gmail.com; jeremy.tandy@metoffice.gov.uk; koubarak@di.uoa.gr; public-dwbp-comments@w3.org; public-sdw-comments@w3.org
Subject: Re: Subsetting data

 

 

Isn't a "subset" just a query result, or which there are effectively an unlimited number?

 

Storing a query so it can be re-run against evolving data has value. Having a URI for that, perhaps less so.

 

Dan

On Thu, 31 Dec 2015, 08:14 Clemens Portele <portele@interactive-instruments.de <mailto:portele@interactive-instruments.de> > wrote:

Rob, 

 

what you describe seems to apply to the dataset (resource) the same way it would apply to any subset resource. I.e. are you discussing a more general question, not the subsetting question?

 

Phil,

 

a (probably often unproblematic) restriction to the temperature/uk/london or stations/manchester approach is that there is only one path, so you end up with limitations on the subsets. If you want to support multiple subsets, e.g. also stations where high speed trains stop, stations that have a ticket shop, etc. then there are several issues with a /{dataset}/{subset}/…/{subset}/{object} approach. These include an unclear URI scheme ("manchester" and "eurostar" would be on the same path level), potential name collisions of subset names of different subsetting categories, and multiple URIs for the same feature/object.

 

Best regards,

Clemens

 

 

On 31 Dec 2015, at 03:07, Rob Atkinson <rob@metalinkage.com.au <mailto:rob@metalinkage.com.au> > wrote:

 

I'm not a strong set-theoretician - but it strikes me there are some tensions here:

 

Does the identifier of a set mean that the members of that set are constant, known in advance and always retrievable?   Is a query endpoint a resource (does either URI or URL have meaning against a query that delivers real time data - including the use case of "at this point in time we think these things are members of this set?" )

 

If the subset is the result of a query - and you care that it is the same subset another time you look at it - are you actually assigning an identifier to the artefact - which is the query response, whose properties include the original query, where it was made, and the time it was made?

 

Can you define an ontology for terms like subset, query, response that you all agree on?

 

I share Phil's implicit concern that subsetting by type with URI patterns may not be universally applicable - IMHO that equates to a "sub-register" pattern, where a set has its members defined by some identifiable process (indepent of any query functions available) - which may include explicit subsets - for example by object type, or delegated registration processes. That probably fits the UK implementation better than a query-defined subset. 

 

If subsets have some prior meaning - and a query is used to access then from a service endpint - then the query is a URL that needs to be bound to the object URI. AFAICT thats a very different thing to saying an arbitrary query result defines a subset of data. 

 

I think you may, in general, assign an ID to the artefact which is the result of a query at a given time, and if you want to make that into something with more semantics then you need make it into a new type of object which can be described in terms of what it means. I think currently the conversation is conflating these two perspectives of "subset".

 

Cheers, and farewell to 2015.

Rob Atkinson.

 

 

 

 

On Thu, 31 Dec 2015 at 08:26 <Simon.Cox@csiro.au <mailto:Simon.Cox@csiro.au> > wrote:

Another way of looking at it is that a query, encoded as a URI pattern, defines an implicit set of potential URIs, each of which denotes a subset. 

Simon J D Cox
Environmental Informatics
CSIRO Land and Water

E simon.cox@csiro.au <mailto:simon.cox@csiro.au>  T +61 3 9545 2365 M +61 403 302 672
Physical: Central Reception, Bayview Avenue, Clayton, Vic 3168
Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
Postal: Private Bag 10, Clayton South, Vic 3169
http://people.csiro.au/Simon-Cox
http://orcid.org/0000-0002-3884-3420
http://researchgate.net/profile/Simon_Cox3 

 

  _____  

From: Phil Archer
Sent: Wednesday, 30 December 2015 6:31:16 PM
To: Manolis Koubarakis; 'public-sdw-comments@w3.org <mailto:public-sdw-comments@w3.org> '; Annette Greiner; Eric Stephan; Tandy, Jeremy; public-dwbp-comments@w3.org <mailto:public-dwbp-comments@w3.org> 
Subject: Subsetting data

At various times in recent months I have promised to look into the topic 
of persistent identifiers for subsets of data. This came up at the SDW 
F2F in Sapporo but has also been raised by Annette in DWBP. In between 
festive activities I've been giving this some thought and have tried to 
begin to commit some ideas to a page [1].

During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible 
way forward, including its geo-temporal extensions defined by the OGC. 
There is also the Linked Data API as a means of doing this, and what 
they both have in common is that they offer an intermediate layer that 
turns a URL into a query.

How do you define a persistent identifier for a subset of a dataset? IMO 
you mint a URI and say "this identifies a subset of a dataset" - and 
then provide a means of programmatically going from the URI to a query 
that returns the subset. As long as you can replace the intermediate 
layer with another one that also returns the same subset, we're done.

The UK Government Linked Data examples tend to be along the lines of:

http://transport.data.gov.uk/id/stations
returns a list of all stations in Britain.

http://transport.data.gov.uk/id/stations/Manchester
returns a list of stations in Manchester

http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
identifies Manchester Piccadilly station.

All of that data of course comes from a single dataset.

Does this work in the real worlds of meteorology and UBL/PNNL?

Phil.




[1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md




-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org <http://philarcher.org/> 
+44 (0)7887 767755
@philarcher1

 

Received on Thursday, 31 December 2015 13:35:50 UTC