Re: Subsetting data

All,
To clarify my own concern about subsetting, I was suggesting a BP that says it is a good idea to make subsets of data available. My suggestion was simply that data should be made available to users of that data in useful chunks smaller than an entire download of all the data. Whether that is implemented via querying a database or not is immaterial to the best practice itself. How the chunks are composed and what hierarchy is used are things that are determined by the kind of data, the schema, the anticipated use cases, and how you choose to release the data. 

I think it’s also a good idea to make the subsets available via unique URIs, though here again I think we go too far if we do more than point to the best practice. It’s fine to offer an example (or two) in a given context, but I think it is beyond our scope to tell people how to build their API or classify data into subsets. Those are data management and architecture issues that we can’t hope to address fully in the implementation section of one best practice.
-Annette


> On Dec 30, 2015, at 10:31 AM, Phil Archer <phila@w3.org> wrote:
> 
> Another way of looking at it is that a query, encoded as a URI pattern, defines an implicit set of potential URIs, each of which denotes a subset. 
> 
> Simon J D Cox
> Environmental Informatics
> CSIRO Land and Water
> 
> E simon.cox@csiro.au T +61 3 9545 2365 M +61 403 302 672
> Physical: Central Reception, Bayview Avenue, Clayton, Vic 3168
> Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
> Postal: Private Bag 10, Clayton South, Vic 3169
> http://people.csiro.au/Simon-Cox
> http://orcid.org/0000-0002-3884-3420
> http://researchgate.net/profile/Simon_Cox3
>  
> From: Phil Archer
> Sent: Wednesday, 30 December 2015 6:31:16 PM
> To: Manolis Koubarakis; 'public-sdw-comments@w3.org'; Annette Greiner; Eric Stephan; Tandy, Jeremy; public-dwbp-comments@w3.org
> Subject: Subsetting data
> 
> At various times in recent months I have promised to look into the topic 
> of persistent identifiers for subsets of data. This came up at the SDW 
> F2F in Sapporo but has also been raised by Annette in DWBP. In between 
> festive activities I've been giving this some thought and have tried to 
> begin to commit some ideas to a page [1].
> 
> During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible 
> way forward, including its geo-temporal extensions defined by the OGC. 
> There is also the Linked Data API as a means of doing this, and what 
> they both have in common is that they offer an intermediate layer that 
> turns a URL into a query.
> 
> How do you define a persistent identifier for a subset of a dataset? IMO 
> you mint a URI and say "this identifies a subset of a dataset" - and 
> then provide a means of programmatically going from the URI to a query 
> that returns the subset. As long as you can replace the intermediate 
> layer with another one that also returns the same subset, we're done.
> 
> The UK Government Linked Data examples tend to be along the lines of:
> 
> http://transport.data.gov.uk/id/stations
> returns a list of all stations in Britain.
> 
> http://transport.data.gov.uk/id/stations/Manchester
> returns a list of stations in Manchester
> 
> http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
> identifies Manchester Piccadilly station.
> 
> All of that data of course comes from a single dataset.
> 
> Does this work in the real worlds of meteorology and UBL/PNNL?
> 
> Phil.
> 
> 
> 
> 
> [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
> 
> 
> 
> 
> -- 
> 
> 
> Phil Archer
> W3C Data Activity Lead
> http://www.w3.org/2013/data/
> 
> http://philarcher.org
> +44 (0)7887 767755
> @philarcher1
> 

Received on Sunday, 3 January 2016 21:30:50 UTC