Re: [dxwg] Dataset series (#868) from matthiaspalmer via GitHub on 2019-09-20 (public-dxwg-wg@w3.org from September 2019)

From: matthiaspalmer via GitHub <sysbot+gh@w3.org>
Date: Fri, 20 Sep 2019 22:32:24 +0000
To: public-dxwg-wg@w3.org
Message-ID: <issue_comment.created-533731730-1569018743-sysbot+gh@w3.org>

@makxdekkers I think you misunderstood me. I fully agree that distributions corresponds to different representation of a dataset. Multiple distributions should **not** be used to point to individual files that together form the dataset. What I am proposing is that a **single** distribution is made up of **several files** pointed to by repeated dcat:downloadURL. I have said this higher up in the thread, but maybe it got lost in all the comments.

Furthermore, my argument that developers are going to get this wrong is based on experience, not speculation. Implementation of harvesting software on a national level have shown severe problems in getting the existing DCAT-AP right ranging from not managing to produce correct RDF to expressing more than half of the fields wrong. I have seen this in at least four existing vendors. I think the reason is a combination of lack of knowledge of RDF, low prioritity on standards compliance as well as the effort needed to make preexisting information models fit with the specification. But maybe I have had bad luck and other have better experience with harvesting from different vendors, I am sure people at EDP can tell you a lot more about this.

I represent a company that take pride in building everything on top of RDF and linked data principles, hence it is in our DNA to go the extra mile to get it right semantically. But still, at the end of the day we have to make our customers happy which implies a good user experience. We cannot force them to create one new datasets per uploaded file, that would make no sense. Potentially we can hide this from them by treating certain datasets as files, but that will require some careful thought and copy pasting of metadata between these file-oriented datasets.

I think an important aspect of an information model like DCAT is that its semantics should feel natural. In the current specification it says in the [DCAT scope](https://w3c.github.io/dxwg/dcat/#dcat-scope):

> A dataset is a collection of data, published or curated by a single agent. Data comes in many forms including numbers, words, pixels, imagery, sound and other multi-media, and potentially other types, any of which might be collected into a dataset.

If for some reason the data provider (the agent) needs to divide the dataset into smaller parts due to its size or due to practical maintenance issues, that is a question of how to access its representation. It should **not** put bounds on the scope of the dataset. If a data owner think of their budget as a single dataset because it is described, published and curated in a unified manner should we then tell them that, no, that is not a dataset because you have divided it into multiple files?

And what would happen if the data provider happen to provide the budget via an API in addition to downloadable files? Having one dataset per downloadable file will now become really weird because each of these datasets would need to have two distributions, one for the downloadable file and the second for the API with some restriction (a parameter) allowing you to access exactly the same information as is available in the downloadable file. It is not certain that the API would even necessarily support this as it would depend on the way the downloadable files have been divided.

With the approach I outlined it would simply be one dataset with two distributions. The first distribution would point to the downloadable files via repeated dcat:downloadURL and the second distribution to the API (potentially using a DataService instance).

--
GitHub Notification of comment by matthiaspalmer
Please view or discuss this issue at https://github.com/w3c/dxwg/issues/868#issuecomment-533731730 using your GitHub account

Received on Friday, 20 September 2019 22:32:26 UTC