RE: Relationship of dcat:Dataset and void:Dataset from Pano Maria on 2017-03-15 (public-lod@w3.org from March 2017)

From: Pano Maria <pano.maria@taxonic.com>
Date: Wed, 15 Mar 2017 10:59:45 +0000
To: Markus Freudenberg <markus.freudenberg@gmail.com>, "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk>
CC: John Erickson <olyerickson@gmail.com>, John Walker <john.walker@semaku.com>, "public-lod@w3.org" <public-lod@w3.org>, "public-dwbp-wg@w3.org" <public-dwbp-wg@w3.org>
Message-ID: <DB5PR07MB1351AD10B1B06B1275F7BCA19B270@DB5PR07MB1351.eurprd07.prod.outlook.com>
I am one of John Walker's colleagues, and as John says we've been having some interesting discussions on this topic. I'm partial to the first option he presents, as our situation is similar to the situation that Alasdair described.

As an example:
We have a collection of data pertaining to the addresses and buildings in the Netherlands that is distributed in many different ways: WFS, WMS, GML data dump, etc. Our linked data version of this collection of data is actually created by transforming one of these sources to linked data and subsequently exposing this via a SPARQL endpoint and REST API's.

In my view a dcat:Dataset is the *abstract* representation of some collection of data. That is, I can say stuff about this dataset in an abstract sense, like who the curator is, what the accrual periodicity is, what the spatial extent is, when it was last updated, etc., without this collection of data having to have any specific concrete form. This also fits well with the situation that Alasdair describes, and the above example.

In my opinion one resource should therefore not be an instance of both dcat:Dataset and void:Dataset, since if we consider the definition of void:Dataset: "A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider" [1], we're clearly not describing an abstract collection of data.

Now, where I agree it all becomes a bit muddy is when you think of a set of RDF triples, i.e. a void:Dataset, being distributable in several different ways (SPARQL endpoint, datadump, LDP API etc.) What does that make our void:Dataset? Note that the properties of the abstract dataset are still relevant to describe all these different forms of the collection of data.

So, maybe what we are missing is a way to distinguish an expression of an (abstract) collection of data from the distribution of that expression. Analogous to the `work -> expression -> manifestation -> item` that the FRBR model [2] uses. That would lead to a dcat:Dataset representing the abstract dataset, one or more expressions of this dataset, e.g. an RDF expression and thus a void:Dataset, and dcat:Distributions of those expressions.

The downside is that it becomes quite philosophical...

As it stands currently, I'm still inclined to consider a void:Dataset a better match with dcat:Distribution than with dcat:Dataset, because of the need to use dcat:Dataset in an expression-independent way.

Kind regards,

Pano Maria

[1] https://www.w3.org/TR/void/

[2] http://www.sparontologies.net/ontologies/fabio


Van: Markus Freudenberg [mailto:markus.freudenberg@gmail.com]
Verzonden: woensdag 15 maart 2017 11:22
Aan: Gray, Alasdair J G
CC: John Erickson; John Walker; public-lod@w3.org; public-dwbp-wg@w3.org
Onderwerp: Re: Relationship of dcat:Dataset and void:Dataset

We had a very similar discussion about how to marry DCAT with VOID (and what to do with void:Dataset) for DataID<http://dataid.dbpedia.org/ns/core.html>.

In the end, we decided to define dataid:Dataset as sub of dcat:Dataset and void:Dataset for the following reasons:

1. their similar definitions :

    void:Dataset "[...] we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian." [1]
    dcat:Dataset "[...] collection of data, published or curated by a single agent, and available for access or download in one or more formats." [2]

It appears, all of what is stated about a dcat:Dataset is true for a void:Dataset (including the possibility of different formats).

2. the similarities between dcat:CatalogRecord and void:DatasetDescription:

Both provide some form of metadata about a dataset. Both are using foaf:topic / foaf:primaryTopc to point out the (Dataset) entity of interest.
When combining DCAT and VOID using the first option, a dcat:CatalogRecord would reference a dcat:Dataset, while a void:DatasetDescription would reference a dcat:Distribution.

3. void:subset

Points out a subset of a void:Dataset. If a void:Dataset is also considered a dcat:Distribution, one would have to deal with the notion of a 'sub-distributions'.
Which is a point of contention (as far as I remember the discussion at SDSVoc). We rather use this property with DataID to provide the missing hierarchical pointers between datasets.

4.  The definition of dcat:Distribution

    dcat:Distribution: "Represents a specific available form of a dataset."

The definition of a void:Dataset is different since it only narrows the available formats of a dataset to RDF, not to a specific serialization. Also, no VOID properties offer no further clarification on the 'specific available format' of the dataset.

VOID Properties like:
classes<http://vocab.deri.ie/void#classes> | distinctObjects<http://vocab.deri.ie/void#distinctObjects> | distinctSubjects<http://vocab.deri.ie/void#distinctSubjects> | documents<http://vocab.deri.ie/void#documents> | entities<http://vocab.deri.ie/void#entities> | properties<http://vocab.deri.ie/void#properties> | property<http://vocab.deri.ie/void#property> | propertyPartition<http://vocab.deri.ie/void#propertyPartition> | triples<http://vocab.deri.ie/void#triples> | vocabulary<http://vocab.deri.ie/void#vocabulary> etc.
are all characteristics of a dataset and not just a single distribution, in my understanding.

These were our main reasons to combine dcat:Dataset and void:Dataset into dataid:Dataset.

Markus Freudenberg

Release Manager, DBpedia<http://wiki.dbpedia.org>

On Tue, Mar 14, 2017 at 5:10 PM, Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>> wrote:
When we were considering this in the Health Care and Life Sciences Community Profile [1] we took the view that the RDF representation was one of several possible distributions for a dataset and that it would be incorrect to associate that distribution information with the notion of the dataset itself. That is, we took the first approach proposed by John.

We specifically did this as not all HCLS datasets are made available in RDF and we did not want to make incorrect inferences.

Best regards,

Alasdair

[1] https://www.w3.org/TR/hcls-dataset/


On 14 Mar 2017, at 14:18, John Erickson <olyerickson@gmail.com<mailto:olyerickson@gmail.com>> wrote:

John makes a great argument for the second approach. That is how we
tend to think of it.

As with most DCAT-related questions, start with "DCAT is like 'Dublin
Core' for datasets." In other words, general purpose, good for
starters, accommodates refinements...

John

On Tue, Mar 14, 2017 at 9:59 AM, John Walker <john.walker@semaku.com<mailto:john.walker@semaku.com>> wrote:

Hello,



Following discussion with colleagues, I would like to ask for opinions on
semantics of dcat:Dataset and void:Dataset.



We have two points of view.



First, the RDF version of a dcat:Dataset is a dcat:distribution of that
dataset and is itself a void:Dataset.

That could be represented as follows:



<my-dataset> a dcat:Dataset ;

 dcat:distribution <my-rdf-dataset> ;

 .

<my-rdf-dataset> a dcat:Distribution , void:Dataset ;

 void:sparqlEndpoint <sparql> ;

void:dataDump <my-dataset.rdf>, <my-dataset.ttl> ;

 .



Secondly that a dcat:Dataset that is available as RDF (and possibly other
forms) is also a void:Dataset.

Or to put it another way: void:Dataset rdfs:subClassOf dcat:Dataset.

That could be represented as follows:



<my-dataset> a dcat:Dataset, void:Dataset ;

 dcat:distribution <my-sparql-distribution>, <my-rdfxml-distribution>,
<my-turtle-distribution>;

 void:sparqlEndpoint <sparql> ;

 void:dataDump <my-dataset.rdf>, <my-dataset.ttl> ;

 .

<my-sparql-distribution> a dcat:Distribution ;

 dcat:accessURL <sparql> ;

 .

<my-rdfxml-distribution> a dcat:Distribution ;

 dcat:downloadURL <my-dataset.rdf> ;

 dcat:mediaType "application/rdf+xml"

 .

<my-turtle-distribution> a dcat:Distribution ;

 dcat:downloadURL <my-dataset.ttl> ;

 dcat:mediaType "text/turtle"

 .



I’m trying to keep an open mind, but leaning towards the second method as
thinking of the SPARQL endpoint, dumps and crawlable linked data (plus other
forms such as an API or WFS endpoint) as different distributions of the same
dataset seems to fit better with the spirit of DCAT (at least to my
interpretation of the recommendation).



Thoughts welcome!



Regards,

John



--
John S. Erickson, Ph.D.
Director of Operations, The Rensselaer IDEA
Deputy Director, Web Science Research Center (RPI)
<http://idea.rpi.edu/> <olyerickson@gmail.com<mailto:olyerickson@gmail.com>>
Twitter & Skype: olyerickson

Alasdair J G Gray
Fellow of the Higher Education Academy
Assistant Professor in Computer Science,
School of Mathematical and Computer Sciences
(Athena SWAN Bronze Award)
Heriot-Watt University, Edinburgh UK.

Email: A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>
Web: http://www.macs.hw.ac.uk/~ajg33

ORCID: http://orcid.org/0000-0002-5711-4872

Office: Earl Mountbatten Building 1.39
Twitter: @gray_alasdair









________________________________

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences.

This email is sent from the Heriot-Watt University Group, which includes Heriot-Watt University, the Edinburgh Business School, and Heriot-Watt Services Ltd (Oriam, Scotland's national performance centre for sport). The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.
Received on Wednesday, 15 March 2017 11:00:33 UTC