Re: Exposing datasets with DCAT (partitioning, subsets..) from Andrea Perego on 2016-02-03 (public-sdw-comments@w3.org from February 2016)

From: Andrea Perego <andrea.perego@jrc.ec.europa.eu>
Date: Wed, 03 Feb 2016 10:05:25 +0100
To: Maik Riechert <m.riechert@reading.ac.uk>
Cc: public-sdw-comments@w3.org
Message-id: <56B1C2D5.4010508@jrc.ec.europa.eu>
Many thanks for sharing this work, Maik!

Just a couple of notes from my side:

1. Besides temporal coverage, it may be worth adding in your scenarios 
also spatial coverage as another criterion of dataset partitioning. 
Actually, both criteria are frequently used concurrently.

2. In many of the scenarios you describe, dataset subsets are modelled 
as datasets. An alternative would be to model them just as 
distributions. So, I wonder whether those scenarios have requirements 
that cannot be met by the latter option.

Some more words on point (2):

As you probably know, there has been quite a long discussion in the 
DCAT-AP WG concerning this issue. The main points are probably 
summarised in the conversation recorded here:

https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets

Of course, in DCAT-AP the objective was how to describe dataset subsets, 
and not about criteria for dataset subsetting.

Notably, the discussion highlighted two different approaches: (a) 
dataset subsets modelled as datasets or (b) dataset subsets modelled 
simply as distributions.

I don't see the two scenarios above as mutually exclusive. You can use 
one or the other depending of your use case and requirements. And you 
can use both (e.g., referring to point (1): time-related subsets 
modelled as child datasets, and their space-related subsets as 
distributions). However, I personally favour the idea of using 
distributions as the recommended option, and datasets only if you cannot 
do otherwise. In particular, I see two main issues with the 
dataset-based approach:

- It includes an additional step to get to the data (dataset -> dataset 
-> distribution). Moreover, subsetting can be recursive - which 
increases the number of steps needed to get to the data.

- I understand that your focus is on data discovery from a machine 
perspective. However, looking at how this will be reflected in 
catalogues used by people, the result is that you're going to have a 
record for each child dataset, in addition to the parent one. This 
scenario is quite typical nowadays (I know quite a few examples of tens 
of records having the same title, description, etc. - or just a slightly 
different one), and it doesn't help at all people trying to find what 
they're looking for.

Thanks

Andrea


On 02/02/2016 12:02, Maik Riechert wrote:
> Hi all,
>
> There has been a lot of discussion about subsetting data. I'd like to
> give a slightly different perspective which is purely motivated from the
> point of view of someone who wants to publish data, and in parallel
> someone who wants to discover and access that data without much hassle.
>
> Of course it is hard to think about all scenarios, so I picked what I
> think are common ones:
> - a bunch of static data files without any API
> - an API without static data files
> - both
>
> And then some specific variations on what structure the data has (yearly
> data files, daily, or another dimension used as splitting point, such as
> spatial).
>
> It is in no way final or complete and may even be wrong, but here is
> what I came up with:
> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas
>
> So it always starts by looking at what data exists and how it is
> exposed, and based on those constraints I tried to model that as DCAT
> datasets, sometimes with subdatasets. Again, it is purely motivated from
> a machine-access point of view. There may be other things to consider.
>
> The point of this wiki page is to have something concrete to discuss
> about and not just abstract ideas. It should uncover problems, possibly
> solutions, perspectives... etc.
>
> Happy to hear your thoughts,
> Maik
>

-- 
Andrea Perego, Ph.D.
Scientific / Technical Project Officer
European Commission DG JRC
Institute for Environment & Sustainability
Unit H06 - Digital Earth & Reference Data
Via E. Fermi, 2749 - TP 262
21027 Ispra VA, Italy

https://ec.europa.eu/jrc/
Received on Wednesday, 3 February 2016 09:06:12 UTC