Re: Subsetting data

Funny you should mention worldwide temperature data since 1901. That 
exists at NERSC, and it's even more complex.
http://portal.nersc.gov/project/20C_Reanalysis/
contains data with many more potentially organizing variables than 
lat/lon and time. Making every potential organization of the data 
available separately would be a waste of time. Instead, we allow the 
user to make a query via a web form that retrieves (via Opendap) the 
data they want. They can choose a specific measure of interest and get 
data over a range combining lats, lons, ensemble members, and times.
This is what I mean by enabling retrieval of subsets.

The organization by measurement type works for our users, but one might 
want to grab different subsets if one's users were, say the residents of 
a certain locale who wanted to look at temperature changes in their home 
town. That would not be worth the effort for us, but someone else could 
come along and download the relevant larger subsets and reuse the data 
in that way. I guess what I'm saying is that one has to strike a balance 
between making the data easy to access for the expected uses and 
enabling others to take things further with a little more work.
-Annette

On 1/4/16 10:05 AM, Frans Knibbe wrote:
>
>
> 2016-01-04 17:57 GMT+01:00 Jon Blower <j.d.blower@reading.ac.uk 
> <mailto:j.d.blower@reading.ac.uk>>:
>
>     Hi all,
>
>     [snip]
>
>     For what it’s worth, I agree with Frans’ view of starting with a
>     very simple approach to subsetting. The first problem to solve is
>     simply how to record that Subset A is a part of Dataset X (and
>     vice versa), assuming that both can be identified (with URIs). In
>     MELODIES we’ve been using dct:isPartOf and dct:hasPart for this,
>     but we’re just experimenting. Both the subset and the parent
>     dataset can be typed as Datasets.
>
>
> Dct:Dataset, I presume? It is nice to see that a single vocabulary 
> already provides the main ingredients. Although perhaps some reasoning 
> is required to handle datasets with subsets (if a dataset has 
> dct:isPartOf properties it is a subset, if it has dct:hasPart 
> properties it is a superset, if it only has dct:hasPart properties it 
> is a root dataset), something that perhaps not all data consumers can 
> be expected to perform?
>
> I do wonder if there already is a good way of describing the 
> subsetting method, or the fact that multiple subsetting methods are 
> used.  Perhaps that is not vital information, but it could be useful. 
> Suppose there is a dataset containing worldwide temperature values 
> since 1901. Both temporal and spatial subdivisions would make sense, 
> so a data provider could decide to make subsets available by year and 
> by country. Each partitioning tree would encompass all data, so if it 
> is not made clear that two partitioning schemes are used for the same 
> dataset a not-so-smart consumer might wind up downloading the data twice.
>
> Greetings,
> Frans
>
>
>
>
>
>     Cheers,
>     Jon
>
>
>
>>     On 4 Jan 2016, at 16:36, Peter Baumann
>>     <p.baumann@jacobs-university.de
>>     <mailto:p.baumann@jacobs-university.de>> wrote:
>>
>>     Frans-
>>
>>     On 2016-01-04 13:48, Frans Knibbe wrote:
>>>
>>>
>>>
>>>     2016-01-04 13:20 GMT+01:00 Peter Baumann
>>>     <p.baumann@jacobs-university.de
>>>     <mailto:p.baumann@jacobs-university.de>>:
>>>
>>>         Hi Frans,
>>>
>>>         data partitioning is an implementation detail which serves
>>>         to quicker determine the subset.
>>>
>>>
>>>     But that does not need to be the only reason for dataset
>>>     partitioning. I can think of some others:
>>>
>>>      1. It could allow complete automatic retrieval of a complete
>>>         dataset without the need for a data dump (in a specific
>>>         format) and without the need for a specialized API. That
>>>         could be very important for building remote indexes (e.g. by
>>>         a search engine);
>>>
>>
>>     IMHO a search engine will not want to download full datasets -
>>     and if so, it will not be interested in how such a large dataset
>>     is split internally. Compare to an HTML file, who would want to
>>     know about the file system blocks it is split into?
>>
>>>      1. It allows automatic representation of the data in a human
>>>         friendly way (small enough HTML pages with annotation and
>>>         navigation);
>>>
>>
>>     while there might be special cases where you have a textual
>>     representation this is not the case with most data, such as sets
>>     of vectors or pixels. An image is best represented for a human as
>>     a visual pixel matrix, and delivering this again is independent
>>     from the storage organisation on a server.
>>
>>>      1. It allows caching of meaningful and self-describing chunks
>>>         of data.
>>>
>>
>>     not sure what self-describing means in this context, but caching
>>     is independent from that. Again, looking at your HTML file do we
>>     want to know which of its file system blocks is in cache?
>>
>>
>>>         As such, any subsetting or querying interface should remain
>>>         agnostic of it, otherwise it runs the risk of supporting a
>>>         particular implementation (and there are quite a few
>>>         specialized implementations out there).
>>>
>>>
>>>     I think the concept of dataset partitioning could be handled in
>>>     a very general way. Perhaps the only required elements are:
>>>
>>>      1. express that something (a web resource) is a dataset (and
>>>         could therefore be a subset or a superset);
>>>      2. express that a dataset is a superset or subset of another
>>>         dataset.
>>>
>>>     This could work with any type of data and with any query
>>>     interface, I think.
>>
>>     well, we don't know - I have raised the question a couple of
>>     times, but it does not find much sympathy: what type of data are
>>     we talking about, and what does subsetting mean for each of them?
>>
>>     -Peter
>>
>>>
>>>     Re
>
>

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory

Received on Monday, 4 January 2016 19:14:35 UTC