Re: Subsetting data

This is a great discussion - and also highlights why some form of BP
guidance is going to be vital.

I think there is an underlying architectural concern here -  which is that
identified subsets should be able to be linked using appropriate
vocabuaries, since this gives us the basic pattern that metadata about a
subset can be discovered and retrieved independently of the actual data,
 and query endpoints discovered.

I think traversal logic will be bound to the linking language semantics -
only things expressed can be actioned - so the expressivity of the dataset
linkiing vocabulary determines the use cases supported by the deployed
content (data + metadata). We dont want to limit that functionality - but
we dont also want the same functionality done in many different ways and
the information space (to remain)  fragmented into islands around each
implementation choice..

The choice of vocabulary is an implementation detail, but perhaps a BP
should highlight available vocabularies and what they are good for.

Fo example, DCT+VOID+RDF-QB+PROV+(a few other little bits to help bind LDA
to REST)  are the set of vocabularies that I have used that allows some
form of graph of heterogenous data sources to be described - but have been
scratching around looking for a elegant way to handle the subsetting issue
- hence following the BP discussion.   "Virtual global datasets" that are
delegated to national agencies is a very common pattern out there (and
repeated at different scales to potentially create a huge network of data
that would dwarf the current LOD cloud :-)

AFAICT there is no "adequate practice" out there - let alone a BP. (I also
need to work out whether reasoning using OWL over heterogenous linking
vocabularies is tenable at run-time or whether this needs to be mapped to a
single linking vocabulary in an implementation - I think this is a question
of whether the graph can be traversed at run-time or does it need to be
built by crawling.)

So I dont think its impossible to recognise the common nature of the
arcitectural patterns, existing optons, and the need for flexibility in
implementation. But IMHO it is complex enough the general community is
going to need a BP - and examples and tools probably - before much useful
functionality emerges in a coherent enough way for anything to traverse it
(human or machine)

Rob Atkinson




On Tue, 5 Jan 2016 at 06:14 Annette Greiner <amgreiner@lbl.gov> wrote:

> Funny you should mention worldwide temperature data since 1901. That
> exists at NERSC, and it's even more complex.
> http://portal.nersc.gov/project/20C_Reanalysis/
> contains data with many more potentially organizing variables than lat/lon
> and time. Making every potential organization of the data available
> separately would be a waste of time. Instead, we allow the user to make a
> query via a web form that retrieves (via Opendap) the data they want. They
> can choose a specific measure of interest and get data over a range
> combining lats, lons, ensemble members, and times.
> This is what I mean by enabling retrieval of subsets.
>
> The organization by measurement type works for our users, but one might
> want to grab different subsets if one's users were, say the residents of a
> certain locale who wanted to look at temperature changes in their home
> town. That would not be worth the effort for us, but someone else could
> come along and download the relevant larger subsets and reuse the data in
> that way. I guess what I'm saying is that one has to strike a balance
> between making the data easy to access for the expected uses and enabling
> others to take things further with a little more work.
> -Annette
>
>
> On 1/4/16 10:05 AM, Frans Knibbe wrote:
>
>
>
> 2016-01-04 17:57 GMT+01:00 Jon Blower <j.d.blower@reading.ac.uk>:
>
>> Hi all,
>>
>> [snip]
>>
>> For what it’s worth, I agree with Frans’ view of starting with a very
>> simple approach to subsetting. The first problem to solve is simply how to
>> record that Subset A is a part of Dataset X (and vice versa), assuming that
>> both can be identified (with URIs). In MELODIES we’ve been using
>> dct:isPartOf and dct:hasPart for this, but we’re just experimenting. Both
>> the subset and the parent dataset can be typed as Datasets.
>>
>
> Dct:Dataset, I presume? It is nice to see that a single vocabulary
> already provides the main ingredients. Although perhaps some reasoning is
> required to handle datasets with subsets (if a dataset has dct:isPartOf
> properties it is a subset, if it has dct:hasPart properties it is a
> superset, if it only has dct:hasPart properties it is a root dataset),
> something that perhaps not all data consumers can be expected to perform?
>
> I do wonder if there already is a good way of describing the
> subsetting method, or the fact that multiple subsetting methods are used.
> Perhaps that is not vital information, but it could be useful. Suppose
> there is a dataset containing worldwide temperature values since 1901.
> Both temporal and spatial subdivisions would make sense, so a data provider
> could decide to make subsets available by year and by country. Each
> partitioning tree would encompass all data, so if it is not made clear that
> two partitioning schemes are used for the same dataset a not-so-smart
> consumer might wind up downloading the data twice.
>
> Greetings,
> Frans
>
>
>
>
>
>>
>>
>
>> Cheers,
>> Jon
>>
>>
>>
>> On 4 Jan 2016, at 16:36, Peter Baumann <p.baumann@jacobs-university.de>
>> wrote:
>>
>> Frans-
>>
>> On 2016-01-04 13:48, Frans Knibbe wrote:
>>
>>
>>
>>
>> 2016-01-04 13:20 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de>
>> :
>>
>>> Hi Frans,
>>>
>>> data partitioning is an implementation detail which serves to quicker
>>> determine the subset.
>>>
>>
>> But that does not need to be the only reason for  dataset partitioning.
>> I can think of some others:
>>
>>    1. It could allow complete automatic retrieval of a complete dataset
>>    without the need for a data dump (in a specific format) and without the
>>    need for a specialized API. That could be very important for building
>>    remote indexes (e.g. by a search engine);
>>
>>
>> IMHO a search engine will not want to download full datasets - and if so,
>> it will not be interested in how such a large dataset is split internally.
>> Compare to an HTML file, who would want to know about the file system
>> blocks it is split into?
>>
>>
>>    1. It allows automatic representation of the data in a human friendly
>>    way (small enough HTML pages with annotation and navigation);
>>
>>
>> while there might be special cases where you have a textual
>> representation this is not the case with most data, such as sets of vectors
>> or pixels. An image is best represented for a human as a visual pixel
>> matrix, and delivering this again is independent from the storage
>> organisation on a server.
>>
>>
>>    1. It allows caching of meaningful and self-describing chunks of
>>    data.
>>
>>
>> not sure what self-describing means in this context, but caching is
>> independent from that. Again, looking at your HTML file do we want to know
>> which of its file system blocks is in cache?
>>
>>
>>
>>
>>> As such, any subsetting or querying interface should remain agnostic of
>>> it, otherwise it runs the risk of supporting a particular implementation
>>> (and there are quite a few specialized implementations out there).
>>>
>>
>> I think the concept of dataset partitioning could be handled in a very
>> general way. Perhaps the only required elements are:
>>
>>    1. express that something (a web resource) is a dataset (and could
>>    therefore be a subset or a superset);
>>    2. express that a dataset is a superset or subset of another dataset.
>>
>>
>> This could work with any type of data and with any query interface, I
>> think.
>>
>>
>> well, we don't know - I have raised the question a couple of times, but
>> it does not find much sympathy: what type of data are we talking about, and
>> what does subsetting mean for each of them?
>>
>> -Peter
>>
>>
>> Re
>>
>>
>
> --
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
>
>
>

Received on Monday, 4 January 2016 21:01:46 UTC