Re: Exposing datasets with DCAT (partitioning, subsets..) from Maik Riechert on 2016-02-03 (public-sdw-comments@w3.org from February 2016)

From: Maik Riechert <m.riechert@reading.ac.uk>
Date: Wed, 3 Feb 2016 11:34:47 +0000
To: Frans Knibbe <frans.knibbe@geodan.nl>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>
Cc: public-sdw-comments@w3.org
Message-ID: <56B1E5D7.8060105@reading.ac.uk>
Hi Frans,

In my opinion, it all depends on how the actual data is made available. 
If it's a nice (possibly standard) API, then just link that as a 
distribution and you're done I would say. Clients can explore subsets 
etc. through that API (which in itself should be self-describing and 
doesn't need any further metadata at the Distribution level, except 
media type if possible).

However, if you really *just* have a bunch of files, as is quite common 
and which may be ok depending on data volume and intended users, then it 
gets more complicated if you want to allow efficient machine access 
without first being forced to download everything.

So, yes, partitioning is intrinsic to the dataset, and that detail is 
exposed to DCAT to allow more efficient access, both for humans and 
machines. It is an optimization in the end, but in my opinion a quite 
useful one.

I wonder how many different partition strategies are really used in the 
wild.

Cheers
Maik

Am 03.02.2016 um 11:13 schrieb Frans Knibbe:
> Hello Andrea, all,
>
> I like to think about dataset partioning as something simple, needing 
> only three semantic ingredients: being able to say that a resource is 
> a dataset, and being able to point to subsets and supersets. DCAT does 
> not seem necessary for those three. Is there really a need to see 
> dataset partioning as DCAT territory? DCAT is a vocabulary for data 
> catalogs, I see dataset partioning as something intrinsic to the 
> dataset - its structure.
>
> That said, data about the structure of a dataset is metadata so it is 
> interesting to think about how data and metadata are coupled. For easy 
> navigation through the structure (by either man or machine) it is 
> probably best to keep the data volume small - metadata only. But it 
> would be nice to have the option to get the actual data from any 
> dataset (at any structural level). That means that additonial elements 
> are needed: a indication of ways to get the actual data, 
> dcat:Distribution for instance. Also an indication of size of the 
> actual data would be very useful, to help decide to get the data or to 
> dig a bit deeper for smaller subsets. Only at the highest level of the 
> structure, the leaves of the tree, could the actual data be returned 
> by default. A friendly data provider will take care that those subsets 
> contain manageable volumes of data.
>
> My thoughts have little basis in practice, but I am trying to set up 
> an experiment with spatially partioned data. I think there are many 
> interesting possibilities. I hope to be able to share something 
> practical with the group soon.
>
> Regards,
> Frans
>
>
>
>
>
> 2016-02-03 10:05 GMT+01:00 Andrea Perego 
> <andrea.perego@jrc.ec.europa.eu <mailto:andrea.perego@jrc.ec.europa.eu>>:
>
>     Many thanks for sharing this work, Maik!
>
>     Just a couple of notes from my side:
>
>     1. Besides temporal coverage, it may be worth adding in your
>     scenarios also spatial coverage as another criterion of dataset
>     partitioning. Actually, both criteria are frequently used
>     concurrently.
>
>     2. In many of the scenarios you describe, dataset subsets are
>     modelled as datasets. An alternative would be to model them just
>     as distributions. So, I wonder whether those scenarios have
>     requirements that cannot be met by the latter option.
>
>     Some more words on point (2):
>
>     As you probably know, there has been quite a long discussion in
>     the DCAT-AP WG concerning this issue. The main points are probably
>     summarised in the conversation recorded here:
>
>     https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets
>
>     Of course, in DCAT-AP the objective was how to describe dataset
>     subsets, and not about criteria for dataset subsetting.
>
>     Notably, the discussion highlighted two different approaches: (a)
>     dataset subsets modelled as datasets or (b) dataset subsets
>     modelled simply as distributions.
>
>     I don't see the two scenarios above as mutually exclusive. You can
>     use one or the other depending of your use case and requirements.
>     And you can use both (e.g., referring to point (1): time-related
>     subsets modelled as child datasets, and their space-related
>     subsets as distributions). However, I personally favour the idea
>     of using distributions as the recommended option, and datasets
>     only if you cannot do otherwise. In particular, I see two main
>     issues with the dataset-based approach:
>
>     - It includes an additional step to get to the data (dataset ->
>     dataset -> distribution). Moreover, subsetting can be recursive -
>     which increases the number of steps needed to get to the data.
>
>     - I understand that your focus is on data discovery from a machine
>     perspective. However, looking at how this will be reflected in
>     catalogues used by people, the result is that you're going to have
>     a record for each child dataset, in addition to the parent one.
>     This scenario is quite typical nowadays (I know quite a few
>     examples of tens of records having the same title, description,
>     etc. - or just a slightly different one), and it doesn't help at
>     all people trying to find what they're looking for.
>
>     Thanks
>
>     Andrea
>
>
>
>     On 02/02/2016 12:02, Maik Riechert wrote:
>
>         Hi all,
>
>         There has been a lot of discussion about subsetting data. I'd
>         like to
>         give a slightly different perspective which is purely
>         motivated from the
>         point of view of someone who wants to publish data, and in
>         parallel
>         someone who wants to discover and access that data without
>         much hassle.
>
>         Of course it is hard to think about all scenarios, so I picked
>         what I
>         think are common ones:
>         - a bunch of static data files without any API
>         - an API without static data files
>         - both
>
>         And then some specific variations on what structure the data
>         has (yearly
>         data files, daily, or another dimension used as splitting
>         point, such as
>         spatial).
>
>         It is in no way final or complete and may even be wrong, but
>         here is
>         what I came up with:
>         https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas
>
>         So it always starts by looking at what data exists and how it is
>         exposed, and based on those constraints I tried to model that
>         as DCAT
>         datasets, sometimes with subdatasets. Again, it is purely
>         motivated from
>         a machine-access point of view. There may be other things to
>         consider.
>
>         The point of this wiki page is to have something concrete to
>         discuss
>         about and not just abstract ideas. It should uncover problems,
>         possibly
>         solutions, perspectives... etc.
>
>         Happy to hear your thoughts,
>         Maik
>
>
>     -- 
>     Andrea Perego, Ph.D.
>     Scientific / Technical Project Officer
>     European Commission DG JRC
>     Institute for Environment & Sustainability
>     Unit H06 - Digital Earth & Reference Data
>     Via E. Fermi, 2749 - TP 262
>     21027 Ispra VA, Italy
>
>     https://ec.europa.eu/jrc/
>
>
Received on Wednesday, 3 February 2016 11:35:21 UTC