Re: DCAT scope and other suggestions from Fadi Maali on 2012-04-19 (public-gld-comments@w3.org from April 2012)

From: Fadi Maali <fadi.maali@deri.org>
Date: Thu, 19 Apr 2012 13:20:44 +0100
To: <vasily.bunakov@stfc.ac.uk> <vasily.bunakov@stfc.ac.uk>
Cc: <public-gld-comments@w3.org>, <michael.wilson@stfc.ac.uk>
Message-Id: <779CB764-0D52-4CCC-BAE9-DE0AE7B36763@deri.org>
Hi Vasily,

Thanks for your feedback and apology for late reply. please find my comments inline

On 11 Apr 2012, at 16:58, <vasily.bunakov@stfc.ac.uk> <vasily.bunakov@stfc.ac.uk> wrote:

> Hi,
> 
> A few comments on http://www.w3.org/TR/2012/WD-vocab-dcat-20120405/ They are different in nature, so they are enumerated to facilitate the discussion.
> 
>  
> 1) The Abstract does not seem to introduce the actual DCAT scope. It reads: "By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs.” This seems to define the scope as DCAT as:
> 1.a) Devoted to datasets only.
> The Abstract does not show an intention to consider lower data organization levels (individual items or records), or any upper data organization levels (data set aggregations/collections, data catalogues, data services). However, there are "Catalog record" and "Catalog" classes defined.
> 1.b) Focused on metadata only.
> The Abstract does not show an intention to facilitate the data consumption (beyond metadata). However, there are "Distribution", "Download", "WebService", and "Feed" classes defined which are intended do deal with the data.

Actually the scope of dcat is to describe the metadata of the datasets and not the datasets contents. However, it also describes the access information (i.e. how a user might access the actual data). I support reflecting this explicitly in the abstract


> 2) There is no versioning for Catalog, Dataset or other classes. So they are in fact "classes" in DCAT, not "instances" despite data consumers (as well as publishers, for their own purposes) may be interested to know about the exact instances of DCAT classes, and to follow the history of changes.
>  
> 3) Related to 2) but an issue of its own (wider than 2): there is no Provenance class, or similar means to describe the origination of the dataset. If DCAT is concerned about the data discoverability only, this is fine; however, data consumers are usually interested in data and metadata origins as this is the indirect means to judge on the data and metadata quality and trust. Data publishers may be interested in supplying enough information about provenance, too.  I would suggest to introduce the Provenance class, or think of other means to support this important aspect of datasets.

Provenance description can go along with dcat description. Another W3C working group is working on Provenance (http://www.w3.org/2011/prov/wiki/Main_Page ). nothing prevents using the two vocabularies together when needed. Dcat also provides the Record class which describes the dataset *entry* in the catalog for example the creation date of the Record instance is the date on which the dataset was listed in the catalog and not the creation date of the dataset itself. This distinction allows describing provenance of both the dataset and the dataset entry in the catalog.

> 
> 4) Publisher role recognized by DCAT is not the only role in respect to the Catalog or the Dataset that data consumers may be interested in. There could be Author or Curator, or Owner for legal purposes, or Funding Body.

that's true. However it didn't show up very frequently based on the survey of catalogs we did. These properties can still be described  by dcterms or other specialized vocabularies but we chose not to include them in the dcat description based on the fact that they are not very frequent.
>  
> 5) "Dataset:license" property may not be enough for the proper description of all the regulation surrounding the dataset. As an example, there can be pre-conditions for the dataset use (like registration, membership, or a fee paid) as well as post-conditions for the dataset use (like disposal of data after the data consumer, well, consumed it - we encountered this sort of requirement for microdata in social science). The pre-conditions or post-conditions are not necessarily a part of the license that typically covers the period of the actual data use only.

Representing data licenses is certainly important however it falls outside the scope of dcat. When it comes to catalogs they currently provide a link to a license. Asking catalogs' publishers to provide more fine-grained descriptions raises the entry barrier and risks their adoption of the model.
>  
> 6) "Dataset: spatial/geographical coverage", "Dataset: temporal coverage", "Dataset: frequency" and "Dataset: granularity" in fact seem to be the characteristics of "scale" so you may wish to combine them in one property (with a reference to some other vocabulary), or find other means to logically link them.

These all come from dcterms except granularity. There was a previous suggestion to define a general coverage property and have both spatial and temporal coverages subsuming it however there was no clear use case or advantage of doing so. If you have one in mind please share it.

>  
> 7) Metadata for the Catalog and the Dataset are defined (in the form of their DCAT properties) but the metadata describing the meaning of the Dataset items/records seems missing. So someone can e.g. download the Excel file described with the DCAT vocabulary but the meaning of columns, and the units of measure in them may easily remain unclear. This would not be an issue if DCAT were really devoted to the datasets metadata only but please see the issue 1.b) above that tells DCAT is actually concerned about the data retrieval, too.

Datasets included in catalogues are very heterogeneous in terms of both format and domain. Not all of them are CSV and therefore "meaning of column" doesn't apply to all and so on. We avoided the attempt to capture the meaning of the data (beyond describing category and keyword) for two reasons:
(a) this information is not provided in the catalogs currently
(b) this is a huge task which falls out of the scope and available resources

> 8) Why Download, WebService and Feed are separate classes is not ultimately clear. They look like variations or subclasses of Distribution class.

they are defined as subclasses of Distribution

>  
> 9) The access / audit log, defined possibly as a separate class, may be of interest both for the dataset publishers/owners/authors/funding bodies as well as for the  dataset consumers. It would allow to judge on the dataset popularity and modes of the dataset re-use, as well as serve as another type of link (not a "referenced-by" but "used-by") that may improve the dataset discoverability.
>  

Usage data is indeed interesting and often used to measure "success" of the open data initiatives. In my opinion this is statistics about the dataset and better be described using the Data Cube Vocabulary (http://www.w3.org/TR/vocab-data-cube/ ). Please notice that it is interesting to represent the usage divided by country, demographics etc which requires powerful capabilities and can't be captured by adding few properties to DCAT. Data Cube is especially designed for this purpose. 

I hope that helped make thing clearer. 
Thanks again,
Fadi Maali 

>  
> With kind regards,
> Vasily Bunakov
> STFC e-Science
>  
> 
> -- 
> Scanned by iCritical.
> 
> 
>
Received on Thursday, 19 April 2012 12:21:18 UTC