RE: Best practice for a loosely-structured catalog? from andrea.perego@ec.europa.eu on 2018-06-08 (public-dxwg-wg@w3.org from June 2018)

From: <andrea.perego@ec.europa.eu>
Date: Fri, 8 Jun 2018 22:27:43 +0000
To: <mail@makxdekkers.com>, <Simon.Cox@csiro.au>, <public-dxwg-wg@w3.org>
CC: <Jonathan.Yu@csiro.au>
Message-ID: <EDFF15E839F79242AA55B1468C63DDA908E046B3@S-DC-ESTG02-J.net1.cec.eu.int>
Makx, Simon,

In the extension of DCAT-AP we use in the JRC Data Catalogue, besides distributions we typically have (a) related publications and (b) "other resources" (a catch-all category including all what is not a distribution or a publication). As I said elsewhere [1,2], related publications are specified via dct:isReferencedBy, whereas "other resources" with dct:relation (used as a generic relationship to link a dataset with any kind of related resources). So, this use case may support the idea of making dcat:distribution a subproperty of dct:relation.

BTW, this pattern is reflected in our CKAN extension – see, e.g.:

http://data.jrc.ec.europa.eu/dataset/jrc-predict-predict2017-core

About the fact that the majority of data catalogues use a simple metadata pattern, this is also my experience. Hierarchical "is part of" relationships are far from being common. There may be a number of reasons. For instance, if metadata are manually created (as it is still usually the case) this would require a high maintenance effort. Also in the geospatial domain, where there's explicitly this notion ("dataset series"), what is documented is just the "root" dataset, and the children are not even linked to. Another issue may be related to limitations of catalogue platforms – which are typically not supporting this feature – or to the usability issues resulting from giving users the burden to choose among a long list of datasets which are almost identical but for some variables (e.g., spatial and/or temporal coverage).

It is also worth noting that the approach used for specifying hierarchical relationships depends very much on the domain and on specific characteristics of a dataset. We have to deal quite often with this issue in the JRC Data Catalogue, and the approaches used are very different – e.g.: 1 dataset with a distribution for each of its children; 1 dataset for each child dataset, and no record for the parent.

So, probably, we should take into account this situation when providing recommendations on how to model hierarchical/subsetting relationships, and propose alternative options, depending on the specific use case.

Cheers,

Andrea

[1] https://www.w3.org/TR/dcat-ucr/#ID9
[2] https://github.com/w3c/dxwg/issues/63#issuecomment-362108520

----
Andrea Perego, Ph.D.
Scientific / Technical Project Officer
European Commission DG JRC
Directorate B - Growth and Innovation
Unit B6 - Digital Economy
Via E. Fermi, 2749 - TP 262
21027 Ispra VA, Italy

https://ec.europa.eu/jrc/

----
The views expressed are purely those of the writer and may
not in any circumstances be regarded as stating an official
position of the European Commission.

From: Makx Dekkers [mailto:mail@makxdekkers.com]
Sent: Friday, June 08, 2018 3:21 PM
To: Simon.Cox@csiro.au; public-dxwg-wg@w3.org
Cc: Jonathan.Yu@csiro.au
Subject: RE: Best practice for a loosely-structured catalog?


Simon,

This is indeed an issue that came up in the development of DCAT-AP. In particular, CKAN is quite liberal in what it accepts as “Resource” related to a Dataset. The discussion was whether you could map CKAN Resource to DCAT Distribution, and it was clear that such mapping would have unwanted effects. This is also related to my earlier question on how “similar” distributions need to be, which led to a statement that they need to be “informationally equivalent” (https://github.com/w3c/dxwg/issues/52).

I support your proposed solution to use  dct:relation as a catch-all and to allow for further specialisation whenever necessary and possible.

Makx.

From: Simon.Cox@csiro.au <Simon.Cox@csiro.au>
Sent: 08 June 2018 03:38
To: public-dxwg-wg@w3.org
Cc: Jonathan.Yu@csiro.au
Subject: Best practice for a loosely-structured catalog?

Catalogueers:

I’ve been doing some investigations of some local repositories and catalogues, and have uncovered that in many cases ‘datasets’ are ‘just a bag of files’. There is no distinction made between part/whole, distribution (representation), and other kinds of relationship (e.g. documentation, schema, supporting documents). So while the precision we are aiming for in DCAT is clearly valuable in terms of semantics, it is difficult to implement on these legacy systems. Mostly I see people using the Dataset-distribution-> relationship for everything … which is clearly incorrect in many cases. But I doubt if we are unusual in this.

I’m thinking about how to advise on this, while not actually breaking DCAT.

If we made dcat:distribution a sub-property of dct:relation

dcat:distribution rdfs:subPropertyOf dct:relation .

then I think we can have a reasonable recommendation to the simple repositories.
We could tell repositories that use the ‘just a bag of files’ approach to say

               :Dataset987 a dcat:Dataset ;
                              dct:relation <file1> , <file2> , <file3> , <file4> , <file5> , <file6> , <file7> … .

which would not be inconsistent with a later reclassification to

               :Dataset987 a dcat:Dataset ;
                              dct:hasPart <file1> , <file2> ;
                              dcat:distribution <file3> , <file4> ;
                              dct:conformsTo <file5> ;
                              dct:requires <file6> ;
dct:references <file7> .

If this is not all mad, I will add a new use-case - something like ‘Mapping from simple repository model’ – as justification, and propose this tiny enhancement.

Simon

Simon J D Cox
Research Scientist - Environmental Informatics
Team Leader – Environmental Information Infrastructure
CSIRO Land and Water<http://www.csiro.au/Research/LWF>

E simon.cox@csiro.au<mailto:simon.cox@csiro.au> T +61 3 9545 2365 M +61 403 302 672
   Mail: Private Bag 10, Clayton South, Vic 3169
   Visit: Central Reception, Research Way, Clayton, Vic 3168
   Deliver: Gate 3, Normanby Road, Clayton, Vic 3168
people.csiro.au/Simon-Cox<http://people.csiro.au/Simon-Cox>
orcid.org/0000-0002-3884-3420<http://orcid.org/0000-0002-3884-3420>
researchgate.net/profile/Simon_Cox3<https://www.researchgate.net/profile/Simon_Cox3>
github.com/dr-shorthair<https://github.com/dr-shorthair>
lov.okfn.org/dataset/lov/agents/Simon%20Cox<http://lov.okfn.org/dataset/lov/agents/Simon%20Cox>
Twitter @dr_shorthair<https://twitter.com/dr_shorthair>
Skype dr_shorthair<skype:dr_shorthair>
https://xkcd.com/1810/

PLEASE NOTE
The information contained in this email may be confidential or privileged. Any unauthorised use or disclosure is prohibited. If you have received this email in error, please delete it immediately and notify the sender by return email. Thank you. To the extent permitted by law, CSIRO does not represent, warrant and/or guarantee that the integrity of this communication has been maintained or that the communication is free of errors, virus, interception or interference.

Please consider the environment before printing this email.
Received on Friday, 8 June 2018 22:28:14 UTC