Re: [dxwg] Dataset series (#868) from Dave Reynolds via GitHub on 2019-08-05 (public-dxwg-wg@w3.org from August 2019)

From: Dave Reynolds via GitHub <sysbot+gh@w3.org>
Date: Mon, 05 Aug 2019 14:03:47 +0000
To: public-dxwg-wg@w3.org
Message-ID: <issue_comment.created-518247589-1565013826-sysbot+gh@w3.org>

Sorry to be late with this comment, haven't been following this work.

As an outsider to this WG can I reinforce the importance of this issue. This has been, and continues to be, a substantial pain point in our attempts to use dcat. Sad to hear that it won't be addressed for DCAT 2.

In our experience with public sector datasets it is relatively rare for a dataset to be a unitary thing which be downloaded in its entirety. More typically the non-realtime datasets we see comprise a series of updates (annual, quarterly, monthly etc as determined by some release cycle). Where possible we provide data services and dumps for the whole series. However, both users and publishers want to explicitly see the series of updates as individual elements they can separately download but regard the collection of those updates as a single dataset with common metadata and want the data, and presentation of it, to reflect that.

Possible approaches to this include:

1. Model each such dataset as a `dcat:Catalog` which then references each update as a separate `dcat:Dataset` with it's own distribution but put all the common metadata on the `dcat:Catalog`. This could work but then it is hard for a generic client to tell the difference between this use of `Catalog` and the "normal" uses of `dcat:Catalog` as (possibly hierarchical) collections of _heterogeneous_ datasets. It's also hard to then point to a Distribution for the whole dataset. It would be possible to support this pattern through a marker subclass of catalog (`dcat:DatasetSeries` or some such).

2. Use `dcat:Dataset` for the series but allow a dataset to have multiple _partial_ distributions, each with a separate temporal/spatial/other extent. This could work but the existing text and UML doesn't encourage per-Distribution extents and implies that a Distribution covers a whole dataset. Furthermore if you have different formats available for each update then the relation between the different partial Distributions would be obscure.

3. Introduce a separate notion of a _partition_ or _element_ of a dataset which can have it's own extent information and its own Distribution(s). This is the route we've used up to now and works fine within our own systems but means that an external client expecting dcat can't see the individual elements in the series. Sadly this is usually the grain size a harvester actually wants to see.

Even if you can't recommend a specific pattern for DCAT 2 would you be able to give some indication of the likely direction of travel (as a guide to those of us who need to work around the limitation in the meantime)?

--
GitHub Notification of comment by der
Please view or discuss this issue at https://github.com/w3c/dxwg/issues/868#issuecomment-518247589 using your GitHub account

Received on Monday, 5 August 2019 14:04:20 UTC