Re: Subsetting data from Antoine Isaac on 2016-01-04 (public-dwbp-comments@w3.org from January 2016)

From: Antoine Isaac <aisaac@few.vu.nl>
Date: Mon, 4 Jan 2016 19:19:47 +0100
To: Frans Knibbe <frans.knibbe@geodan.nl>, Jon Blower <j.d.blower@reading.ac.uk>
CC: Peter Baumann <p.baumann@jacobs-university.de>, Phil Archer <phila@w3.org>, Manolis Koubarakis <koubarak@di.uoa.gr>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>, Annette Greiner <amgreiner@lbl.gov>, Eric Stephan <ericphb@gmail.com>, "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>, "public-dwbp-comments@w3.org" <public-dwbp-comments@w3.org>
Message-ID: <568AB7C3.6000002@few.vu.nl>

Hi everyone,

For the record VoID has already a pattern to represent the sort of meta-data you seem to be after:
http://www.w3.org/TR/void/#subset

I must say I'm not entirely sure it matches your need, as I confess not to fully get why there is such a long thread on this topic in the first place. In fact I agree with Annette's two-paragraphs earlier summary as a BP and am not sure we should go much further (of course an example would be good, hence my pointing to the VoID pattern hoping it may help us not to re-invent the wheel).

Best,

Antoine

On 1/4/16 7:05 PM, Frans Knibbe wrote:
>
>
> 2016-01-04 17:57 GMT+01:00 Jon Blower <j.d.blower@reading.ac.uk <mailto:j.d.blower@reading.ac.uk>>:
>
>     Hi all,
>
>     [snip]
>
>     For what it’s worth, I agree with Frans’ view of starting with a very simple approach to subsetting. The first problem to solve is simply how to record that Subset A is a part of Dataset X (and vice versa), assuming that both can be identified (with URIs). In MELODIES we’ve been using dct:isPartOf and dct:hasPart for this, but we’re just experimenting. Both the subset and the parent dataset can be typed as Datasets.
>
>
> Dct:Dataset, I presume? It is nice to see that a single vocabulary already provides the main ingredients. Although perhaps some reasoning is required to handle datasets with subsets (if a dataset has dct:isPartOf properties it is a subset, if it has dct:hasPart properties it is a superset, if it only has dct:hasPart properties it is a root dataset), something that perhaps not all data consumers can be expected to perform?
>
> I do wonder if there already is a good way of describing the subsetting method, or the fact that multiple subsetting methods are used.  Perhaps that is not vital information, but it could be useful. Suppose there is a dataset containing worldwide temperature values since 1901. Both temporal and spatial subdivisions would make sense, so a data provider could decide to make subsets available by year and by country. Each partitioning tree would encompass all data, so if it is not made clear that two partitioning schemes are used for the same dataset a not-so-smart consumer might wind up downloading the data twice.
>
> Greetings,
> Frans
>
>
>
>
>
>     Cheers,
>     Jon
>
>
>
>>     On 4 Jan 2016, at 16:36, Peter Baumann <p.baumann@jacobs-university.de <mailto:p.baumann@jacobs-university.de>> wrote:
>>
>>     Frans-
>>
>>     On 2016-01-04 13:48, Frans Knibbe wrote:
>>>
>>>
>>>
>>>     2016-01-04 13:20 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de <mailto:p.baumann@jacobs-university.de>>:
>>>
>>>         Hi Frans,
>>>
>>>         data partitioning is an implementation detail which serves to quicker determine the subset.
>>>
>>>
>>>     But that does not need to be the only reason for dataset partitioning. I can think of some others:
>>>
>>>      1. It could allow complete automatic retrieval of a complete dataset without the need for a data dump (in a specific format) and without the need for a specialized API. That could be very important for building remote indexes (e.g. by a search engine);
>>>
>>
>>     IMHO a search engine will not want to download full datasets - and if so, it will not be interested in how such a large dataset is split internally. Compare to an HTML file, who would want to know about the file system blocks it is split into?
>>
>>>      1. It allows automatic representation of the data in a human friendly way (small enough HTML pages with annotation and navigation);
>>>
>>
>>     while there might be special cases where you have a textual representation this is not the case with most data, such as sets of vectors or pixels. An image is best represented for a human as a visual pixel matrix, and delivering this again is independent from the storage organisation on a server.
>>
>>>      1. It allows caching of meaningful and self-describing chunks of data.
>>>
>>
>>     not sure what self-describing means in this context, but caching is independent from that. Again, looking at your HTML file do we want to know which of its file system blocks is in cache?
>>
>>
>>>         As such, any subsetting or querying interface should remain agnostic of it, otherwise it runs the risk of supporting a particular implementation (and there are quite a few specialized implementations out there).
>>>
>>>
>>>     I think the concept of dataset partitioning could be handled in a very general way. Perhaps the only required elements are:
>>>
>>>      1. express that something (a web resource) is a dataset (and could therefore be a subset or a superset);
>>>      2. express that a dataset is a superset or subset of another dataset.
>>>
>>>     This could work with any type of data and with any query interface, I think.
>>
>>     well, we don't know - I have raised the question a couple of times, but it does not find much sympathy: what type of data are we talking about, and what does subsetting mean for each of them?
>>
>>     -Peter
>>
>>>
>>>     Re
>
>

Received on Monday, 4 January 2016 18:20:20 UTC