Re: Subsetting data

On 31/12/2015 10:54, Frans Knibbe wrote:
> Phil,
>
> Thank you for bringing up an interesting subject at a time where not much
> seems to be going on.
>
> I think a key question is: Which data should be returned when a dataset URI
> is dereferenced?
>
> And I think the answer should be: at least the metadata describing the
> dataset or the subset, and optionally the actual data.

I'd say: If I ask for the current temperature in Amsterdam, that's what 
I want. Good practice would be to include metadata, or links to it 
(dcterms:isPartOf <allTemperaturesInNL>).

I don't disagree that those things are important, metadata clearly is  - 
and goodness knows I like links :-) It's a scoping/capacity to deliver 
question.

Phil

>
> When discussing datasets and subsets it is good to look at the Vocabulary
> of Interlinked Datasets (VoID) <http://www.w3.org/TR/void/>, although its
> scope could be too narrow because it is intended to be used for RDF data.
> It can be used to make clear that a chunk of data describes a dataset (
> void:Dataset <http://rdfs.org/ns/void#Dataset>) and has subsets (void:subset
> <http://www.w3.org/TR/void/#subset>). The Data Catalog Vocabulary
> <http://www.w3.org/TR/vocab-dcat/> has a broader scope (it can be used for
> any dataset) and has its own definition of a dataset (dcat:Dataset
> <http://www.w3.org/ns/dcat#Dataset>). DCAT does not seem to have a way of
> identifying subsets, but I guess dcterms:hasPart
> <http://purl.org/dc/terms/hasPart> and dcterms:isPartOf
> <http://purl.org/dc/terms/isPartOf> can be used to express parent-child
> relationships between data collections (dataset mereology).
>
> So let's assume it is possible to indicate that a set of data describe a
> dataset and that it is possible to express in a general way that the
> dataset is a subset of a parent dataset and itself is the parent of a
> collection of subsets. The data that are returned when a dataset URI is
> dereferenced could then include:
>
>
>     - A link to the parent dataset (if there is one)
>     - Links to child datasets (if they exist)
>     - Descriptions of how to get the actual data (if there are not included
>     in the response), for example the URI of a SPARQL endpoint or the URIs of
>     other standard web APIs
>     - Other general metadata, like spatial extent, temporal extent, human
>     readable labels, subject(s), etc.
>     - The actual data that from the dataset
>
> A recommendation or good practice could be to include the actual data OR
> point to subsets. That way there is never a dead end when links are
> followed. A data provider could decide the best level of a subset returning
> actual data, for example when the amount of data is manageable.
>
> What I particularly like about this approach is that if the data server
> supports HTML (or another format that is supported by web crawlers), we
> will have satisfied the crawlability requirement
> <http://www.w3.org/TR/sdw-ucr/#Crawlability> and the discoverability
> requirement <http://www.w3.org/TR/sdw-ucr/#Discoverability>.  A web crawler
> could use any dataset URI as a starting point and by recursively visiting
> all links always have access to the complete dataset, in a way that does
> not require any fancy querying. I hope the search engine people (Ed,
> Charles) can confirm this...
>
> Another thing I like about this approach is that the spatial properties of
> a dataset can be helpful in partioning a dataset into managable subsets. An
> obvious method would be to use administrative (mereological) relationship:
> A European dataset has a subsets for each country, a country dataset has
> subsets for each province, and so on. If that possibility is absent it
> should always be possible to use a tiling mechanism to partition the
> dataset into subsets. I like to think of this as a nice example of how
> geospatial practice can be benificial to the Web as a whole.
>
> By the way, I would like to look at the transport.data.gov.uk examples, but
> I get 404s.
>
> Regards,
> Frans
>
>
>
>
>
>
>
>
>
>
>
> 2015-12-30 19:31 GMT+01:00 Phil Archer <phila@w3.org>:
>
>> At various times in recent months I have promised to look into the topic
>> of persistent identifiers for subsets of data. This came up at the SDW F2F
>> in Sapporo but has also been raised by Annette in DWBP. In between festive
>> activities I've been giving this some thought and have tried to begin to
>> commit some ideas to a page [1].
>>
>> During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible way
>> forward, including its geo-temporal extensions defined by the OGC. There is
>> also the Linked Data API as a means of doing this, and what they both have
>> in common is that they offer an intermediate layer that turns a URL into a
>> query.
>>
>> How do you define a persistent identifier for a subset of a dataset? IMO
>> you mint a URI and say "this identifies a subset of a dataset" - and then
>> provide a means of programmatically going from the URI to a query that
>> returns the subset. As long as you can replace the intermediate layer with
>> another one that also returns the same subset, we're done.
>>
>> The UK Government Linked Data examples tend to be along the lines of:
>>
>> http://transport.data.gov.uk/id/stations
>> returns a list of all stations in Britain.
>>
>> http://transport.data.gov.uk/id/stations/Manchester
>> returns a list of stations in Manchester
>>
>> http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
>> identifies Manchester Piccadilly station.
>>
>> All of that data of course comes from a single dataset.
>>
>> Does this work in the real worlds of meteorology and UBL/PNNL?
>>
>> Phil.
>>
>>
>>
>>
>> [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
>>
>>
>>
>>
>> --
>>
>>
>> Phil Archer
>> W3C Data Activity Lead
>> http://www.w3.org/2013/data/
>>
>> http://philarcher.org
>> +44 (0)7887 767755
>> @philarcher1
>>
>>
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1

Received on Friday, 1 January 2016 09:32:51 UTC