Re: Subsetting data from Frans Knibbe on 2016-01-04 (public-dwbp-comments@w3.org from January 2016)

From: Frans Knibbe <frans.knibbe@geodan.nl>
Date: Mon, 4 Jan 2016 18:36:52 +0100
To: Peter Baumann <p.baumann@jacobs-university.de>
Cc: Phil Archer <phila@w3.org>, Manolis Koubarakis <koubarak@di.uoa.gr>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>, Annette Greiner <amgreiner@lbl.gov>, Eric Stephan <ericphb@gmail.com>, "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>, public-dwbp-comments@w3.org
Message-ID: <CAFVDz42+Cdt7GKG6eo-tdMx6W3RuRSpNNEK5nY9zYGn7ysbinA@mail.gmail.com>
2016-01-04 17:36 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de>:

> Frans-
>
> On 2016-01-04 13:48, Frans Knibbe wrote:
>
>
>
>
> 2016-01-04 13:20 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de>:
>
>> Hi Frans,
>>
>> data partitioning is an implementation detail which serves to quicker
>> determine the subset.
>>
>
> But that does not need to be the only reason for  dataset partitioning. I
> can think of some others:
>
>    1. It could allow complete automatic retrieval of a complete dataset
>    without the need for a data dump (in a specific format) and without the
>    need for a specialized API. That could be very important for building
>    remote indexes (e.g. by a search engine);
>
>
> IMHO a search engine will not want to download full datasets - and if so,
> it will not be interested in how such a large dataset is split internally.
> Compare to an HTML file, who would want to know about the file system
> blocks it is split into?
>

Whether a search engine will want to download full datasets depends on the
purpose of the search engine and the datasets themselves. But I think there
is merit in giving search engines the opportunity to at least access all
data in a dataset and optionally download some or all of it for indexing.

Indeed a search engine perhaps is not interested in how the dataset is
subdivided (although it could be), but the fact that a dataset *is*
subdivided should be interesting *if* that very subdivision allows access
to all data in a straightforward manner (i.e. by following hyperlinks).


>
>
>
>    1. It allows automatic representation of the data in a human friendly
>    way (small enough HTML pages with annotation and navigation);
>
>
> while there might be special cases where you have a textual representation
> this is not the case with most data, such as sets of vectors or pixels. An
> image is best represented for a human as a visual pixel matrix, and
> delivering this again is independent from the storage organisation on a
> server.
>

Representation of data in a webbrowser does not have to be textual. It is
imaginable that sets of spatial data such as (collections of) vectors or
pixels will be recognized as such by a web browser and be displayed in an
appropriate manner (a map image, for example). Further standardization of
spatial data on the web can forward this cause.


>
>
>
>    1. It allows caching of meaningful and self-describing chunks of data.
>
>
> not sure what self-describing means in this context, but caching is
> independent from that. Again, looking at your HTML file do we want to know
> which of its file system blocks is in cache?
>

Perhaps I can try an example... Suppose there is a dataset with climate
data from all over the world. It has full metadata describing sources,
usage rights, spatial and temporal coverage, etc. It is partitioned in
subsets that have data for each country. Each subset has its own metadata,
particularly its unique metadata (it can refer to the metadata of the
parent dataset for general metadata). It is imaginable that some national
agency wants to provide quick access to just the subset of national data,
or that consumers in a particular country have a preference for only
national data. So the national subset is cached locally, but it is still is
a meaningful and properly described entity by itself. And there is no need
to include data from other countries in the cache.


>
>
>
>
>
>> As such, any subsetting or querying interface should remain agnostic of
>> it, otherwise it runs the risk of supporting a particular implementation
>> (and there are quite a few specialized implementations out there).
>>
>
> I think the concept of dataset partitioning could be handled in a very
> general way. Perhaps the only required elements are:
>
>    1. express that something (a web resource) is a dataset (and could
>    therefore be a subset or a superset);
>    2. express that a dataset is a superset or subset of another dataset.
>
> This could work with any type of data and with any query interface, I
> think.
>
>
> well, we don't know - I have raised the question a couple of times, but it
> does not find much sympathy: what type of data are we talking about, and
> what does subsetting mean for each of them?
>

I think the general scope of this discssion is best practices. Ideally best
practices help with all kinds of data (spatial data in the SDWWG case).
Subsetting comes in to play as soon as datasets become unwieldy as a whole.
Of course that is a flexible definition, but it could be something like 'be
able to comfortably fit on a web page' or 'be larger than two megabytes'.
If the dataset is larger, partitioning can be considered and could be
beneficial for data consumption.

It seems like a good idea to me to test any proposed best practices with
different types of data,  particularly n-dimensional coverage data.

Greetings,
Frans


> -Peter
>
>
>
> Regards,
> Frans
>
>>
>>
>> -Peter
>>
>>
>>
>> On 2016-01-04 13:14, Frans Knibbe wrote:
>>
>>
>>
>> 2016-01-01 10:33 GMT+01:00 Phil Archer < <phila@w3.org>phila@w3.org>:
>>
>>>
>>>
>>> On 31/12/2015 10:54, Frans Knibbe wrote:
>>>
>>>> Phil,
>>>>
>>>> Thank you for bringing up an interesting subject at a time where not
>>>> much
>>>> seems to be going on.
>>>>
>>>> I think a key question is: Which data should be returned when a dataset
>>>> URI
>>>> is dereferenced?
>>>>
>>>> And I think the answer should be: at least the metadata describing the
>>>> dataset or the subset, and optionally the actual data.
>>>>
>>>
>>> I'd say: If I ask for the current temperature in Amsterdam, that's what
>>> I want. Good practice would be to include metadata, or links to it
>>> (dcterms:isPartOf <allTemperaturesInNL>).
>>>
>>> I don't disagree that those things are important, metadata clearly is  -
>>> and goodness knows I like links :-) It's a scoping/capacity to deliver
>>> question.
>>>
>>
>> It probably also has to do with the kind of subsets we have in mind.
>> Perhaps two kinds can be distinguished:
>>
>> 1) subsets that are predetermined subdivisions of a dataset. The dataset
>> partioning is done by the data publisher and likely to be documented (e.g.
>> by means of dataset metadata). The partioning is likely to be obvious and
>> useful to many consumers, e.g. time slices in a temporal data set, tiles or
>> administrative subdivisions in a geographic data set.
>> 2) subsets that are the result of an *ad hoc* query, ephemeral subsets.
>>
>> The two do not need to be mutually exclusive.
>>
>> I particularly like the possibilities of agreed best practices for the
>> first type of subsets. A predetermined partioning of a dataset could result
>> in at least one logical tree of data, in which branches and leaves are well
>> linked and thus allow for easy navigation within the dataset, for humans
>> and machines (crawlers) alike. I think there could be interesting
>> possibilities for common practices in structuring and describing a
>> partioned dataset, a main advantage being improved discoverability because
>> a crawler could easily access all the data in a dataset from the root node,
>> or from any node in a dataset. For humans the predetermined structure
>> should be useful too because it could help serving HTML pages that are not
>> to big, are well documented and are easy to navigate.
>>
>> How do you see the relationship between the topic of persistent
>> identifiers for subsets and the SDWWG requirements for linkability,
>> discoverability and crawlability of data?
>>
>> Regards,
>> Frans
>>
>>
>>
>>
>>>
>>> Phil
>>>
>>>
>>>> When discussing datasets and subsets it is good to look at the
>>>> Vocabulary
>>>> of Interlinked Datasets (VoID) <http://www.w3.org/TR/void/>, although
>>>> its
>>>> scope could be too narrow because it is intended to be used for RDF
>>>> data.
>>>> It can be used to make clear that a chunk of data describes a dataset (
>>>> void:Dataset < <http://rdfs.org/ns/void#Dataset>
>>>> http://rdfs.org/ns/void#Dataset>) and has subsets (void:subset
>>>> <http://www.w3.org/TR/void/#subset>). The Data Catalog Vocabulary
>>>> <http://www.w3.org/TR/vocab-dcat/> has a broader scope (it can be used
>>>> for
>>>> any dataset) and has its own definition of a dataset (dcat:Dataset
>>>> <http://www.w3.org/ns/dcat#Dataset>). DCAT does not seem to have a way
>>>> of
>>>> identifying subsets, but I guess dcterms:hasPart
>>>> <http://purl.org/dc/terms/hasPart> and dcterms:isPartOf
>>>> <http://purl.org/dc/terms/isPartOf> can be used to express parent-child
>>>> relationships between data collections (dataset mereology).
>>>>
>>>> So let's assume it is possible to indicate that a set of data describe a
>>>> dataset and that it is possible to express in a general way that the
>>>> dataset is a subset of a parent dataset and itself is the parent of a
>>>> collection of subsets. The data that are returned when a dataset URI is
>>>> dereferenced could then include:
>>>>
>>>>
>>>>     - A link to the parent dataset (if there is one)
>>>>     - Links to child datasets (if they exist)
>>>>     - Descriptions of how to get the actual data (if there are not
>>>> included
>>>>     in the response), for example the URI of a SPARQL endpoint or the
>>>> URIs of
>>>>     other standard web APIs
>>>>     - Other general metadata, like spatial extent, temporal extent,
>>>> human
>>>>     readable labels, subject(s), etc.
>>>>     - The actual data that from the dataset
>>>>
>>>> A recommendation or good practice could be to include the actual data OR
>>>> point to subsets. That way there is never a dead end when links are
>>>> followed. A data provider could decide the best level of a subset
>>>> returning
>>>> actual data, for example when the amount of data is manageable.
>>>>
>>>> What I particularly like about this approach is that if the data server
>>>> supports HTML (or another format that is supported by web crawlers), we
>>>> will have satisfied the crawlability requirement
>>>> <http://www.w3.org/TR/sdw-ucr/#Crawlability> and the discoverability
>>>> requirement < <http://www.w3.org/TR/sdw-ucr/#Discoverability>
>>>> http://www.w3.org/TR/sdw-ucr/#Discoverability>.  A web crawler
>>>>
>>>> could use any dataset URI as a starting point and by recursively
>>>> visiting
>>>> all links always have access to the complete dataset, in a way that does
>>>> not require any fancy querying. I hope the search engine people (Ed,
>>>> Charles) can confirm this...
>>>>
>>>> Another thing I like about this approach is that the spatial properties
>>>> of
>>>> a dataset can be helpful in partioning a dataset into managable
>>>> subsets. An
>>>> obvious method would be to use administrative (mereological)
>>>> relationship:
>>>> A European dataset has a subsets for each country, a country dataset has
>>>> subsets for each province, and so on. If that possibility is absent it
>>>> should always be possible to use a tiling mechanism to partition the
>>>> dataset into subsets. I like to think of this as a nice example of how
>>>> geospatial practice can be benificial to the Web as a whole.
>>>>
>>>> By the way, I would like to look at the transport.data.gov.uk
>>>> examples, but
>>>> I get 404s.
>>>>
>>>> Regards,
>>>> Frans
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-12-30 19:31 GMT+01:00 Phil Archer < <phila@w3.org>phila@w3.org>:
>>>>
>>>> At various times in recent months I have promised to look into the topic
>>>>> of persistent identifiers for subsets of data. This came up at the SDW
>>>>> F2F
>>>>> in Sapporo but has also been raised by Annette in DWBP. In between
>>>>> festive
>>>>> activities I've been giving this some thought and have tried to begin
>>>>> to
>>>>> commit some ideas to a page [1].
>>>>>
>>>>> During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible
>>>>> way
>>>>> forward, including its geo-temporal extensions defined by the OGC.
>>>>> There is
>>>>> also the Linked Data API as a means of doing this, and what they both
>>>>> have
>>>>> in common is that they offer an intermediate layer that turns a URL
>>>>> into a
>>>>> query.
>>>>>
>>>>> How do you define a persistent identifier for a subset of a dataset?
>>>>> IMO
>>>>> you mint a URI and say "this identifies a subset of a dataset" - and
>>>>> then
>>>>> provide a means of programmatically going from the URI to a query that
>>>>> returns the subset. As long as you can replace the intermediate layer
>>>>> with
>>>>> another one that also returns the same subset, we're done.
>>>>>
>>>>> The UK Government Linked Data examples tend to be along the lines of:
>>>>>
>>>>> http://transport.data.gov.uk/id/stations
>>>>> returns a list of all stations in Britain.
>>>>>
>>>>> http://transport.data.gov.uk/id/stations/Manchester
>>>>> returns a list of stations in Manchester
>>>>>
>>>>> http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
>>>>> identifies Manchester Piccadilly station.
>>>>>
>>>>> All of that data of course comes from a single dataset.
>>>>>
>>>>> Does this work in the real worlds of meteorology and UBL/PNNL?
>>>>>
>>>>> Phil.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Phil Archer
>>>>> W3C Data Activity Lead
>>>>> http://www.w3.org/2013/data/
>>>>>
>>>>> http://philarcher.org
>>>>> +44 (0)7887 767755 <%2B44%20%280%297887%20767755>
>>>>> @philarcher1
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>>
>>>
>>> Phil Archer
>>> W3C Data Activity Lead
>>> http://www.w3.org/2013/data/
>>>
>>> http://philarcher.org
>>> +44 (0)7887 767755
>>> @philarcher1
>>>
>>
>>
>> --
>> Dr. Peter Baumann
>>  - Professor of Computer Science, Jacobs University Bremen
>>    www.faculty.jacobs-university.de/pbaumann
>>    mail: p.baumann@jacobs-university.de
>>    tel: +49-421-200-3178, fax: +49-421-200-493178
>>  - Executive Director, rasdaman GmbH Bremen (HRB 26793)
>>    www.rasdaman.com, mail: baumann@rasdaman.com
>>    tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882
>> "Si forte in alienas manus oberraverit hec peregrina epistola incertis ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD 1083)
>>
>>
>>
>>
>
> --
> Dr. Peter Baumann
>  - Professor of Computer Science, Jacobs University Bremen
>    www.faculty.jacobs-university.de/pbaumann
>    mail: p.baumann@jacobs-university.de
>    tel: +49-421-200-3178, fax: +49-421-200-493178
>  - Executive Director, rasdaman GmbH Bremen (HRB 26793)
>    www.rasdaman.com, mail: baumann@rasdaman.com
>    tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882
> "Si forte in alienas manus oberraverit hec peregrina epistola incertis ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD 1083)
>
>
>
>
Received on Monday, 4 January 2016 17:37:29 UTC