- From: Frans Knibbe <frans.knibbe@geodan.nl>
- Date: Thu, 31 Dec 2015 11:54:30 +0100
- To: Phil Archer <phila@w3.org>
- Cc: Manolis Koubarakis <koubarak@di.uoa.gr>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>, Annette Greiner <amgreiner@lbl.gov>, Eric Stephan <ericphb@gmail.com>, "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>, public-dwbp-comments@w3.org
- Message-ID: <CAFVDz40vYoZarxedJJ5ndO43zh5HUpkyPbuJofEuqds3VTsVsQ@mail.gmail.com>
Phil, Thank you for bringing up an interesting subject at a time where not much seems to be going on. I think a key question is: Which data should be returned when a dataset URI is dereferenced? And I think the answer should be: at least the metadata describing the dataset or the subset, and optionally the actual data. When discussing datasets and subsets it is good to look at the Vocabulary of Interlinked Datasets (VoID) <http://www.w3.org/TR/void/>, although its scope could be too narrow because it is intended to be used for RDF data. It can be used to make clear that a chunk of data describes a dataset ( void:Dataset <http://rdfs.org/ns/void#Dataset>) and has subsets (void:subset <http://www.w3.org/TR/void/#subset>). The Data Catalog Vocabulary <http://www.w3.org/TR/vocab-dcat/> has a broader scope (it can be used for any dataset) and has its own definition of a dataset (dcat:Dataset <http://www.w3.org/ns/dcat#Dataset>). DCAT does not seem to have a way of identifying subsets, but I guess dcterms:hasPart <http://purl.org/dc/terms/hasPart> and dcterms:isPartOf <http://purl.org/dc/terms/isPartOf> can be used to express parent-child relationships between data collections (dataset mereology). So let's assume it is possible to indicate that a set of data describe a dataset and that it is possible to express in a general way that the dataset is a subset of a parent dataset and itself is the parent of a collection of subsets. The data that are returned when a dataset URI is dereferenced could then include: - A link to the parent dataset (if there is one) - Links to child datasets (if they exist) - Descriptions of how to get the actual data (if there are not included in the response), for example the URI of a SPARQL endpoint or the URIs of other standard web APIs - Other general metadata, like spatial extent, temporal extent, human readable labels, subject(s), etc. - The actual data that from the dataset A recommendation or good practice could be to include the actual data OR point to subsets. That way there is never a dead end when links are followed. A data provider could decide the best level of a subset returning actual data, for example when the amount of data is manageable. What I particularly like about this approach is that if the data server supports HTML (or another format that is supported by web crawlers), we will have satisfied the crawlability requirement <http://www.w3.org/TR/sdw-ucr/#Crawlability> and the discoverability requirement <http://www.w3.org/TR/sdw-ucr/#Discoverability>. A web crawler could use any dataset URI as a starting point and by recursively visiting all links always have access to the complete dataset, in a way that does not require any fancy querying. I hope the search engine people (Ed, Charles) can confirm this... Another thing I like about this approach is that the spatial properties of a dataset can be helpful in partioning a dataset into managable subsets. An obvious method would be to use administrative (mereological) relationship: A European dataset has a subsets for each country, a country dataset has subsets for each province, and so on. If that possibility is absent it should always be possible to use a tiling mechanism to partition the dataset into subsets. I like to think of this as a nice example of how geospatial practice can be benificial to the Web as a whole. By the way, I would like to look at the transport.data.gov.uk examples, but I get 404s. Regards, Frans 2015-12-30 19:31 GMT+01:00 Phil Archer <phila@w3.org>: > At various times in recent months I have promised to look into the topic > of persistent identifiers for subsets of data. This came up at the SDW F2F > in Sapporo but has also been raised by Annette in DWBP. In between festive > activities I've been giving this some thought and have tried to begin to > commit some ideas to a page [1]. > > During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible way > forward, including its geo-temporal extensions defined by the OGC. There is > also the Linked Data API as a means of doing this, and what they both have > in common is that they offer an intermediate layer that turns a URL into a > query. > > How do you define a persistent identifier for a subset of a dataset? IMO > you mint a URI and say "this identifies a subset of a dataset" - and then > provide a means of programmatically going from the URI to a query that > returns the subset. As long as you can replace the intermediate layer with > another one that also returns the same subset, we're done. > > The UK Government Linked Data examples tend to be along the lines of: > > http://transport.data.gov.uk/id/stations > returns a list of all stations in Britain. > > http://transport.data.gov.uk/id/stations/Manchester > returns a list of stations in Manchester > > http://transport.data.gov.uk/id/stations/Manchester/Piccadilly > identifies Manchester Piccadilly station. > > All of that data of course comes from a single dataset. > > Does this work in the real worlds of meteorology and UBL/PNNL? > > Phil. > > > > > [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md > > > > > -- > > > Phil Archer > W3C Data Activity Lead > http://www.w3.org/2013/data/ > > http://philarcher.org > +44 (0)7887 767755 > @philarcher1 > >
Received on Thursday, 31 December 2015 10:55:02 UTC