Re: Subsetting data

2016-01-04 17:57 GMT+01:00 Jon Blower <j.d.blower@reading.ac.uk>:

> Hi all,
>
> [snip]
>
> For what it’s worth, I agree with Frans’ view of starting with a very
> simple approach to subsetting. The first problem to solve is simply how to
> record that Subset A is a part of Dataset X (and vice versa), assuming that
> both can be identified (with URIs). In MELODIES we’ve been using
> dct:isPartOf and dct:hasPart for this, but we’re just experimenting. Both
> the subset and the parent dataset can be typed as Datasets.
>

Dct:Dataset, I presume? It is nice to see that a single vocabulary already
provides the main ingredients. Although perhaps some reasoning is required
to handle datasets with subsets (if a dataset has dct:isPartOf properties
it is a subset, if it has dct:hasPart properties it is a superset, if it
only has dct:hasPart properties it is a root dataset), something that
perhaps not all data consumers can be expected to perform?

I do wonder if there already is a good way of describing the
subsetting method, or the fact that multiple subsetting methods are used.
Perhaps that is not vital information, but it could be useful. Suppose
there is a dataset containing worldwide temperature values since 1901. Both
temporal and spatial subdivisions would make sense, so a data provider
could decide to make subsets available by year and by country. Each
partitioning tree would encompass all data, so if it is not made clear that
two partitioning schemes are used for the same dataset a not-so-smart
consumer might wind up downloading the data twice.

Greetings,
Frans





>
>

> Cheers,
> Jon
>
>
>
> On 4 Jan 2016, at 16:36, Peter Baumann <p.baumann@jacobs-university.de>
> wrote:
>
> Frans-
>
> On 2016-01-04 13:48, Frans Knibbe wrote:
>
>
>
>
> 2016-01-04 13:20 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de>:
>
>> Hi Frans,
>>
>> data partitioning is an implementation detail which serves to quicker
>> determine the subset.
>>
>
> But that does not need to be the only reason for  dataset partitioning. I
> can think of some others:
>
>    1. It could allow complete automatic retrieval of a complete dataset
>    without the need for a data dump (in a specific format) and without the
>    need for a specialized API. That could be very important for building
>    remote indexes (e.g. by a search engine);
>
>
> IMHO a search engine will not want to download full datasets - and if so,
> it will not be interested in how such a large dataset is split internally.
> Compare to an HTML file, who would want to know about the file system
> blocks it is split into?
>
>
>    1. It allows automatic representation of the data in a human friendly
>    way (small enough HTML pages with annotation and navigation);
>
>
> while there might be special cases where you have a textual representation
> this is not the case with most data, such as sets of vectors or pixels. An
> image is best represented for a human as a visual pixel matrix, and
> delivering this again is independent from the storage organisation on a
> server.
>
>
>    1. It allows caching of meaningful and self-describing chunks of data.
>
>
> not sure what self-describing means in this context, but caching is
> independent from that. Again, looking at your HTML file do we want to know
> which of its file system blocks is in cache?
>
>
>
>
>> As such, any subsetting or querying interface should remain agnostic of
>> it, otherwise it runs the risk of supporting a particular implementation
>> (and there are quite a few specialized implementations out there).
>>
>
> I think the concept of dataset partitioning could be handled in a very
> general way. Perhaps the only required elements are:
>
>    1. express that something (a web resource) is a dataset (and could
>    therefore be a subset or a superset);
>    2. express that a dataset is a superset or subset of another dataset.
>
> This could work with any type of data and with any query interface, I
> think.
>
>
> well, we don't know - I have raised the question a couple of times, but it
> does not find much sympathy: what type of data are we talking about, and
> what does subsetting mean for each of them?
>
> -Peter
>
>
> Re
>
>

Received on Monday, 4 January 2016 18:05:35 UTC