Re: Exposing datasets with DCAT (partitioning, subsets..) from Rob Atkinson on 2016-02-10 (public-sdw-comments@w3.org from February 2016)

From: Rob Atkinson <rob@metalinkage.com.au>
Date: Wed, 10 Feb 2016 21:58:12 +0000
To: Maik Riechert <m.riechert@reading.ac.uk>, Rob Atkinson <rob@metalinkage.com.au>, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>, Jon Blower <j.d.blower@reading.ac.uk>
Cc: Frans Knibbe <frans.knibbe@geodan.nl>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
Message-ID: <CACfF9LxKGDV9vOejN8hmu4Gt0WtOveYvi7CPzMo4Fpoa3qit7w@mail.gmail.com>
I was really thinking about being parsimonious with the information
management - having the smallest number of manual (meta)data curation tasks
and the maximum consistency and usefulness in the derived information -
this smacks of OLAP with data warehouses to meet user needs, but highly
normalised transactional backends to manage data in the most
fast/reliable/cheap. way.

how the DCAT description is best realised is a separate issue - and this
depends on what operations clients expect to be able to do using it (what
are the Use Cases here?)

rob

On Wed, 10 Feb 2016 at 23:07 Maik Riechert <m.riechert@reading.ac.uk> wrote:

> See below
>
>
> On 04/02/2016 03:39, Rob Atkinson wrote:
>
>
> As a straw man...
>
> lets nicely describe a dimensional dataset (i.e. we can subset on ranges
> on  any dimension)  - its kind of nice to use RDF-QB for this - as we can
> describe dimensions using SKOS, OWL etc - all very powerful and a lot more
> useful than DCAT for machines to use the data.
>
> (If DCAT is for cataloguing and discovery - then we should not overload it
> with the description that RDF-QB can provide.)
>
> so lets say we generate a dataset on the fly via an API (pre-prepared
> subsets provided as files are just a case of doing this at a different
> point in the delivery chain.)
>
> I would think it would be possible to take a DCAT and a RDF-QB description
> and generate a DCAT description for each subset - provided your description
> of the dimension is good enough to define the granularity of access. so the
> question of how to do it might boil down to is there enough information to
> generate a new DCAT record on the fly..
>
> Just for clarification, with DCAT record, do you mean a DCAT dataset or a
> DCAT distribution? I would say the latter.
>
> Cheers
>
> Maik
>
>
>
> this needs more thought than i am giving it here - but I would have
> thought there should be enough information in such a DCAT record to
> a) distinguish it from other subsets and allow a search using the
> dimensions of the original dataset to find the DCAT record in a large
> catalog.
> b) to be able to collate such subsets and rebuild the original data cube
> and its metadata (i.e. the domain of each dimension of the subset is
> retained, but its range is made explict)
> c) to define how it relates to the original dataset and the methods used
> to subset the data - to make it possible to re-create the dataset
>
> If DCAT can be used safely in these modes then how to use DCAT to describe
> data subsets should be clear. If you cannot support these approaches then
> IMHO you are better off avoiding DCAT and treating subsets as datasets -
> and move to a different information model designed explicitly for this.
>
> Rob Atkinson
>
>
>
>
> On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Jon,
>> I agree completely here. Many time we are 'forced' to partition data due
>> to availability of, or improvements in query techniques... or simply
>> requests from customers!
>> Our data(set) partitioning strategies are dependent on a multitude of
>> data modeling assumptions and decisions. They can also be determined by the
>> hardware and software we are using to persist and query the data.
>> Lewis
>>
>> On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower <j.d.blower@reading.ac.uk>
>> wrote:
>>
>>> Hi all,
>>>
>>> Just to chip in - I think that dataset partitioning is *not*
>>> (necessarily) intrinsic to the dataset [1], but is a property of data
>>> distribution (hence perhaps in scope for DCAT). A dataset might be
>>> partitioned differently depending on user preference. Some users may prefer
>>> a geographic partitioning, others may prefer a temporal partitioning. Still
>>> others might want to partition by variable. One can imagine different
>>> catalogues serving the “same” data to different users in different ways
>>> (and in fact this does happen with large-volume geographic data like
>>> satellite imagery or global models).
>>>
>>>  like to think about dataset partioning as something simple, needing
>>> only three semantic ingredients: being able to say that a resource is a
>>> dataset, and being able to point to subsets and supersets.
>>>
>>>
>>> I agree with this. I think this is the “level zero” requirement for
>>> partitioning.
>>>
>>> [1] Actually, it probably depends on what you mean by "the dataset". If
>>> you mean the logical entity, then the partitioning is not a property of the
>>> dataset. But if you regard the dataset as a set of physical files then
>>> maybe the partitioning *is* a property of the dataset.
>>>
>>> Cheers,
>>> Jon
>>>
>>>
>>> On 3 Feb 2016, at 11:34, Maik Riechert <m.riechert@reading.ac.uk> wrote:
>>>
>>> Hi Frans,
>>>
>>> In my opinion, it all depends on how the actual data is made available.
>>> If it's a nice (possibly standard) API, then just link that as a
>>> distribution and you're done I would say. Clients can explore subsets etc.
>>> through that API (which in itself should be self-describing and doesn't
>>> need any further metadata at the Distribution level, except media type if
>>> possible).
>>>
>>> However, if you really *just* have a bunch of files, as is quite common
>>> and which may be ok depending on data volume and intended users, then it
>>> gets more complicated if you want to allow efficient machine access without
>>> first being forced to download everything.
>>>
>>> So, yes, partitioning is intrinsic to the dataset, and that detail is
>>> exposed to DCAT to allow more efficient access, both for humans and
>>> machines. It is an optimization in the end, but in my opinion a quite
>>> useful one.
>>>
>>> I wonder how many different partition strategies are really used in the
>>> wild.
>>>
>>> Cheers
>>> Maik
>>>
>>> Am 03.02.2016 um 11:13 schrieb Frans Knibbe:
>>>
>>> Hello Andrea, all,
>>>
>>> I like to think about dataset partioning as something simple, needing
>>> only three semantic ingredients: being able to say that a resource is a
>>> dataset, and being able to point to subsets and supersets. DCAT does not
>>> seem necessary for those three. Is there really a need to see dataset
>>> partioning as DCAT territory? DCAT is a vocabulary for data catalogs, I see
>>> dataset partioning as something intrinsic to the dataset - its structure.
>>>
>>> That said, data about the structure of a dataset is metadata so it is
>>> interesting to think about how data and metadata are coupled. For easy
>>> navigation through the structure (by either man or machine) it is probably
>>> best to keep the data volume small - metadata only. But it would be nice to
>>> have the option to get the actual data from any dataset (at any structural
>>> level). That means that additonial elements are needed: a indication of
>>> ways to get the actual data, dcat:Distribution for instance. Also an
>>> indication of size of the actual data would be very useful, to help decide
>>> to get the data or to dig a bit deeper for smaller subsets. Only at the
>>> highest level of the structure, the leaves of the tree, could the actual
>>> data be returned by default. A friendly data provider will take care that
>>> those subsets contain manageable volumes of data.
>>>
>>> My thoughts have little basis in practice, but I am trying to set up an
>>> experiment with spatially partioned data. I think there are many
>>> interesting possibilities. I hope to be able to share something practical
>>> with the group soon.
>>>
>>> Regards,
>>> Frans
>>>
>>>
>>>
>>>
>>>
>>> 2016-02-03 10:05 GMT+01:00 Andrea Perego <andrea.perego@jrc.ec.europa.eu
>>> >:
>>>
>>>> Many thanks for sharing this work, Maik!
>>>>
>>>> Just a couple of notes from my side:
>>>>
>>>> 1. Besides temporal coverage, it may be worth adding in your scenarios
>>>> also spatial coverage as another criterion of dataset partitioning.
>>>> Actually, both criteria are frequently used concurrently.
>>>>
>>>> 2. In many of the scenarios you describe, dataset subsets are modelled
>>>> as datasets. An alternative would be to model them just as distributions.
>>>> So, I wonder whether those scenarios have requirements that cannot be met
>>>> by the latter option.
>>>>
>>>> Some more words on point (2):
>>>>
>>>> As you probably know, there has been quite a long discussion in the
>>>> DCAT-AP WG concerning this issue. The main points are probably summarised
>>>> in the conversation recorded here:
>>>>
>>>>
>>>> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets
>>>>
>>>> Of course, in DCAT-AP the objective was how to describe dataset
>>>> subsets, and not about criteria for dataset subsetting.
>>>>
>>>> Notably, the discussion highlighted two different approaches: (a)
>>>> dataset subsets modelled as datasets or (b) dataset subsets modelled simply
>>>> as distributions.
>>>>
>>>> I don't see the two scenarios above as mutually exclusive. You can use
>>>> one or the other depending of your use case and requirements. And you can
>>>> use both (e.g., referring to point (1): time-related subsets modelled as
>>>> child datasets, and their space-related subsets as distributions). However,
>>>> I personally favour the idea of using distributions as the recommended
>>>> option, and datasets only if you cannot do otherwise. In particular, I see
>>>> two main issues with the dataset-based approach:
>>>>
>>>> - It includes an additional step to get to the data (dataset -> dataset
>>>> -> distribution). Moreover, subsetting can be recursive - which increases
>>>> the number of steps needed to get to the data.
>>>>
>>>> - I understand that your focus is on data discovery from a machine
>>>> perspective. However, looking at how this will be reflected in catalogues
>>>> used by people, the result is that you're going to have a record for each
>>>> child dataset, in addition to the parent one. This scenario is quite
>>>> typical nowadays (I know quite a few examples of tens of records having the
>>>> same title, description, etc. - or just a slightly different one), and it
>>>> doesn't help at all people trying to find what they're looking for.
>>>>
>>>> Thanks
>>>>
>>>> Andrea
>>>>
>>>>
>>>>
>>>> On 02/02/2016 12:02, Maik Riechert wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> There has been a lot of discussion about subsetting data. I'd like to
>>>>> give a slightly different perspective which is purely motivated from
>>>>> the
>>>>> point of view of someone who wants to publish data, and in parallel
>>>>> someone who wants to discover and access that data without much hassle.
>>>>>
>>>>> Of course it is hard to think about all scenarios, so I picked what I
>>>>> think are common ones:
>>>>> - a bunch of static data files without any API
>>>>> - an API without static data files
>>>>> - both
>>>>>
>>>>> And then some specific variations on what structure the data has
>>>>> (yearly
>>>>> data files, daily, or another dimension used as splitting point, such
>>>>> as
>>>>> spatial).
>>>>>
>>>>> It is in no way final or complete and may even be wrong, but here is
>>>>> what I came up with:
>>>>> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas
>>>>>
>>>>> So it always starts by looking at what data exists and how it is
>>>>> exposed, and based on those constraints I tried to model that as DCAT
>>>>> datasets, sometimes with subdatasets. Again, it is purely motivated
>>>>> from
>>>>> a machine-access point of view. There may be other things to consider.
>>>>>
>>>>> The point of this wiki page is to have something concrete to discuss
>>>>> about and not just abstract ideas. It should uncover problems, possibly
>>>>> solutions, perspectives... etc.
>>>>>
>>>>> Happy to hear your thoughts,
>>>>> Maik
>>>>>
>>>>>
>>>> --
>>>> Andrea Perego, Ph.D.
>>>> Scientific / Technical Project Officer
>>>> European Commission DG JRC
>>>> Institute for Environment & Sustainability
>>>> Unit H06 - Digital Earth & Reference Data
>>>> Via E. Fermi, 2749 - TP 262
>>>> 21027 Ispra VA, Italy
>>>>
>>>> https://ec.europa.eu/jrc/
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --
>> *Lewis*
>>
>
>
Received on Wednesday, 10 February 2016 21:59:00 UTC