Re: Subsetting data

Hello Annette,

I think the two types of subsetting I mentioned earlier (a dataset
partition that is predetermined by the data provider, and subsets that are
based on *ad hoc* queries using an API) each have their advantages so the
best situation would be for them to coexist.

Greetings,
Frans

2016-01-04 20:14 GMT+01:00 Annette Greiner <amgreiner@lbl.gov>:

> Funny you should mention worldwide temperature data since 1901. That
> exists at NERSC, and it's even more complex.
> http://portal.nersc.gov/project/20C_Reanalysis/
> contains data with many more potentially organizing variables than lat/lon
> and time. Making every potential organization of the data available
> separately would be a waste of time. Instead, we allow the user to make a
> query via a web form that retrieves (via Opendap) the data they want. They
> can choose a specific measure of interest and get data over a range
> combining lats, lons, ensemble members, and times.
> This is what I mean by enabling retrieval of subsets.
>
> The organization by measurement type works for our users, but one might
> want to grab different subsets if one's users were, say the residents of a
> certain locale who wanted to look at temperature changes in their home
> town. That would not be worth the effort for us, but someone else could
> come along and download the relevant larger subsets and reuse the data in
> that way. I guess what I'm saying is that one has to strike a balance
> between making the data easy to access for the expected uses and enabling
> others to take things further with a little more work.
> -Annette
>
>
> On 1/4/16 10:05 AM, Frans Knibbe wrote:
>
>
>
> 2016-01-04 17:57 GMT+01:00 Jon Blower <j.d.blower@reading.ac.uk>:
>
>> Hi all,
>>
>> [snip]
>>
>> For what it’s worth, I agree with Frans’ view of starting with a very
>> simple approach to subsetting. The first problem to solve is simply how to
>> record that Subset A is a part of Dataset X (and vice versa), assuming that
>> both can be identified (with URIs). In MELODIES we’ve been using
>> dct:isPartOf and dct:hasPart for this, but we’re just experimenting. Both
>> the subset and the parent dataset can be typed as Datasets.
>>
>
> Dct:Dataset, I presume? It is nice to see that a single vocabulary
> already provides the main ingredients. Although perhaps some reasoning is
> required to handle datasets with subsets (if a dataset has dct:isPartOf
> properties it is a subset, if it has dct:hasPart properties it is a
> superset, if it only has dct:hasPart properties it is a root dataset),
> something that perhaps not all data consumers can be expected to perform?
>
> I do wonder if there already is a good way of describing the
> subsetting method, or the fact that multiple subsetting methods are used.
> Perhaps that is not vital information, but it could be useful. Suppose
> there is a dataset containing worldwide temperature values since 1901.
> Both temporal and spatial subdivisions would make sense, so a data provider
> could decide to make subsets available by year and by country. Each
> partitioning tree would encompass all data, so if it is not made clear that
> two partitioning schemes are used for the same dataset a not-so-smart
> consumer might wind up downloading the data twice.
>
> Greetings,
> Frans
>
>
>
>
>
>>
>>
>
>> Cheers,
>> Jon
>>
>>
>>
>> On 4 Jan 2016, at 16:36, Peter Baumann < <p.baumann@jacobs-university.de>
>> p.baumann@jacobs-university.de> wrote:
>>
>> Frans-
>>
>> On 2016-01-04 13:48, Frans Knibbe wrote:
>>
>>
>>
>>
>> 2016-01-04 13:20 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de>
>> :
>>
>>> Hi Frans,
>>>
>>> data partitioning is an implementation detail which serves to quicker
>>> determine the subset.
>>>
>>
>> But that does not need to be the only reason for  dataset partitioning.
>> I can think of some others:
>>
>>    1. It could allow complete automatic retrieval of a complete dataset
>>    without the need for a data dump (in a specific format) and without the
>>    need for a specialized API. That could be very important for building
>>    remote indexes (e.g. by a search engine);
>>
>>
>> IMHO a search engine will not want to download full datasets - and if so,
>> it will not be interested in how such a large dataset is split internally.
>> Compare to an HTML file, who would want to know about the file system
>> blocks it is split into?
>>
>>
>>    1. It allows automatic representation of the data in a human friendly
>>    way (small enough HTML pages with annotation and navigation);
>>
>>
>> while there might be special cases where you have a textual
>> representation this is not the case with most data, such as sets of vectors
>> or pixels. An image is best represented for a human as a visual pixel
>> matrix, and delivering this again is independent from the storage
>> organisation on a server.
>>
>>
>>    1. It allows caching of meaningful and self-describing chunks of
>>    data.
>>
>>
>> not sure what self-describing means in this context, but caching is
>> independent from that. Again, looking at your HTML file do we want to know
>> which of its file system blocks is in cache?
>>
>>
>>
>>
>>> As such, any subsetting or querying interface should remain agnostic of
>>> it, otherwise it runs the risk of supporting a particular implementation
>>> (and there are quite a few specialized implementations out there).
>>>
>>
>> I think the concept of dataset partitioning could be handled in a very
>> general way. Perhaps the only required elements are:
>>
>>    1. express that something (a web resource) is a dataset (and could
>>    therefore be a subset or a superset);
>>    2. express that a dataset is a superset or subset of another dataset.
>>
>>
>> This could work with any type of data and with any query interface, I
>> think.
>>
>>
>> well, we don't know - I have raised the question a couple of times, but
>> it does not find much sympathy: what type of data are we talking about, and
>> what does subsetting mean for each of them?
>>
>> -Peter
>>
>>
>> Re
>>
>>
>
> --
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
>
>
>

Received on Tuesday, 5 January 2016 08:30:47 UTC