DCAT scope and other suggestions from vasily.bunakov@stfc.ac.uk on 2012-04-11 (public-gld-comments@w3.org from April 2012)

From: <vasily.bunakov@stfc.ac.uk>
Date: Wed, 11 Apr 2012 15:58:24 +0000
To: <public-gld-comments@w3.org>
CC: <michael.wilson@stfc.ac.uk>
Message-ID: <2593C73BF2969942A2CA1F7C437688FB0D08A53E@EXCHMBX03.fed.cclrc.ac.uk>

Hi,
A few comments on http://www.w3.org/TR/2012/WD-vocab-dcat-20120405/ They are different in nature, so they are enumerated to facilitate the discussion.

1) The Abstract does not seem to introduce the actual DCAT scope. It reads: "By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs." This seems to define the scope as DCAT as:
1.a) Devoted to datasets only.
The Abstract does not show an intention to consider lower data organization levels (individual items or records), or any upper data organization levels (data set aggregations/collections, data catalogues, data services). However, there are "Catalog record" and "Catalog" classes defined.
1.b) Focused on metadata only.
The Abstract does not show an intention to facilitate the data consumption (beyond metadata). However, there are "Distribution", "Download", "WebService", and "Feed" classes defined which are intended do deal with the data.

2) There is no versioning for Catalog, Dataset or other classes. So they are in fact "classes" in DCAT, not "instances" despite data consumers (as well as publishers, for their own purposes) may be interested to know about the exact instances of DCAT classes, and to follow the history of changes.

3) Related to 2) but an issue of its own (wider than 2): there is no Provenance class, or similar means to describe the origination of the dataset. If DCAT is concerned about the data discoverability only, this is fine; however, data consumers are usually interested in data and metadata origins as this is the indirect means to judge on the data and metadata quality and trust. Data publishers may be interested in supplying enough information about provenance, too. I would suggest to introduce the Provenance class, or think of other means to support this important aspect of datasets.

4) Publisher role recognized by DCAT is not the only role in respect to the Catalog or the Dataset that data consumers may be interested in. There could be Author or Curator, or Owner for legal purposes, or Funding Body.

5) "Dataset:license" property may not be enough for the proper description of all the regulation surrounding the dataset. As an example, there can be pre-conditions for the dataset use (like registration, membership, or a fee paid) as well as post-conditions for the dataset use (like disposal of data after the data consumer, well, consumed it - we encountered this sort of requirement for microdata in social science). The pre-conditions or post-conditions are not necessarily a part of the license that typically covers the period of the actual data use only.

6) "Dataset: spatial/geographical coverage", "Dataset: temporal coverage", "Dataset: frequency" and "Dataset: granularity" in fact seem to be the characteristics of "scale" so you may wish to combine them in one property (with a reference to some other vocabulary), or find other means to logically link them.

7) Metadata for the Catalog and the Dataset are defined (in the form of their DCAT properties) but the metadata describing the meaning of the Dataset items/records seems missing. So someone can e.g. download the Excel file described with the DCAT vocabulary but the meaning of columns, and the units of measure in them may easily remain unclear. This would not be an issue if DCAT were really devoted to the datasets metadata only but please see the issue 1.b) above that tells DCAT is actually concerned about the data retrieval, too.
8) Why Download, WebService and Feed are separate classes is not ultimately clear. They look like variations or subclasses of Distribution class.

9) The access / audit log, defined possibly as a separate class, may be of interest both for the dataset publishers/owners/authors/funding bodies as well as for the dataset consumers. It would allow to judge on the dataset popularity and modes of the dataset re-use, as well as serve as another type of link (not a "referenced-by" but "used-by") that may improve the dataset discoverability.

With kind regards,
Vasily Bunakov
STFC e-Science

--
Scanned by iCritical.

Received on Thursday, 12 April 2012 14:31:21 UTC