Re: Subsetting data

Hi Jon,

On 2016-01-04 17:57, Jon Blower wrote:
> Hi all,
>
> Interesting discussion - but I think Peter and Frans are talking at cross
> purposes here. I think Frans meant “data partitioning” in the most general
> sense of splitting up a dataset into subsets. I think Peter has taken this
> term to mean the low-level of partitioning of data on a filesystem or in a
> database. (Apologies if I’ve misunderstood either of you!)

hm, can we have several ways of subsetting simultaneously? Let us take the
example of a satellite image map, for simplicity. We might clarify whether we
1 - want to put pixels into different places (RAM vs disk = caching, different
places on disk = tiling, different nodes = distribution, etc.)
2 - want to impose a _logical_ structure on a physically contiguous object. This
means: sets of links containing subsetting directives - our discussion about
"subsetting URIs". A partOf is fine for this, as you describe with MELODIES.

While (1) and (2) can technically be combined it complicates discussion even
more if we start mixing both.

>
> For what it’s worth, I agree with Frans’ view of starting with a very simple
> approach to subsetting. The first problem to solve is simply how to record
> that Subset A is a part of Dataset X (and vice versa), assuming that both can
> be identified (with URIs). In MELODIES we’ve been using dct:isPartOf and
> dct:hasPart for this, but we’re just experimenting. Both the subset and the
> parent dataset can be typed as Datasets.
>
> I think search engines would want to know about this kind of partitioning. It
> would help them to group subsets underneath parent datasets and avoid the
> problem of search results being dominated by lots of “fragments” (which
> happens in some cases, e.g. some of the resources in the GEOSS).
>
> Query languages/interfaces are interesting, but I think they are a separate
> kind of discussion. Just having a BP for expressing whole-part relationships
> would be a step forward. A query is one way to arrive at a part, but there are
> lots of rabbit-holes and details we don’t need to get into for a “first cut”,
> in my opinion.

This whole discussion remains vague and doesn't touch grounds. It started with
(2) and now involves (1) as well while not having concluded on (2). I fully
agree with your view, and tried to motivate this. I am looking forward to the
group establishing what subsetting should mean, as a basis. Is it partOf?

-Peter

>
> Cheers,
> Jon
>
>
>
>> On 4 Jan 2016, at 16:36, Peter Baumann <p.baumann@jacobs-university.de> wrote:
>>
>> Frans-
>>
>> On 2016-01-04 13:48, Frans Knibbe wrote:
>>>
>>>
>>>
>>> 2016-01-04 13:20 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de
>>> <mailto:p.baumann@jacobs-university.de>>:
>>>
>>>     Hi Frans,
>>>
>>>     data partitioning is an implementation detail which serves to quicker
>>>     determine the subset.
>>>
>>>
>>> But that does not need to be the only reason for  dataset partitioning. I
>>> can think of some others:
>>>
>>>  1. It could allow complete automatic retrieval of a complete dataset
>>>     without the need for a data dump (in a specific format) and without the
>>>     need for a specialized API. That could be very important for building
>>>     remote indexes (e.g. by a search engine);
>>>
>>
>> IMHO a search engine will not want to download full datasets - and if so, it
>> will not be interested in how such a large dataset is split internally.
>> Compare to an HTML file, who would want to know about the file system blocks
>> it is split into?
>>
>>>  1. It allows automatic representation of the data in a human friendly way
>>>     (small enough HTML pages with annotation and navigation);
>>>
>>
>> while there might be special cases where you have a textual representation
>> this is not the case with most data, such as sets of vectors or pixels. An
>> image is best represented for a human as a visual pixel matrix, and
>> delivering this again is independent from the storage organisation on a server.
>>
>>>  1. It allows caching of meaningful and self-describing chunks of data.
>>>
>>
>> not sure what self-describing means in this context, but caching is
>> independent from that. Again, looking at your HTML file do we want to know
>> which of its file system blocks is in cache?
>>
>>
>>>  
>>>
>>>     As such, any subsetting or querying interface should remain agnostic of
>>>     it, otherwise it runs the risk of supporting a particular implementation
>>>     (and there are quite a few specialized implementations out there).
>>>
>>>
>>> I think the concept of dataset partitioning could be handled in a very
>>> general way. Perhaps the only required elements are:
>>>
>>>  1. express that something (a web resource) is a dataset (and could
>>>     therefore be a subset or a superset);
>>>  2. express that a dataset is a superset or subset of another dataset. 
>>>
>>> This could work with any type of data and with any query interface, I think.
>>
>> well, we don't know - I have raised the question a couple of times, but it
>> does not find much sympathy: what type of data are we talking about, and what
>> does subsetting mean for each of them?
>>
>> -Peter
>>
>>>
>>> Re

-- 
Dr. Peter Baumann
 - Professor of Computer Science, Jacobs University Bremen
   www.faculty.jacobs-university.de/pbaumann
   mail: p.baumann@jacobs-university.de
   tel: +49-421-200-3178, fax: +49-421-200-493178
 - Executive Director, rasdaman GmbH Bremen (HRB 26793)
   www.rasdaman.com, mail: baumann@rasdaman.com
   tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882
"Si forte in alienas manus oberraverit hec peregrina epistola incertis ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD 1083)

Received on Tuesday, 5 January 2016 10:36:37 UTC