RE: Relationship of dcat:Dataset and void:Dataset from John Walker on 2017-03-15 (public-lod@w3.org from March 2017)

From: John Walker <john.walker@semaku.com>
Date: Wed, 15 Mar 2017 14:31:30 +0000
To: Dave Reynolds <dave.e.reynolds@gmail.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <AM4PR0201MB1730B6F08798B73EA3DDDFF49A270@AM4PR0201MB1730.eurprd02.prod.outlook.>
Thanks to Pano for the extra detail on our use case and contributions of everyone :)

Perhaps another way to look at it is the RDF dataset is derived from the non-RDF dataset.
One could then use PROV to describe the provenance in more detail if useful.
That would certainly make sense when the RDF is a 'lossy' conversion i.e. when not all data from the source dataset is mapped into RDF.

When the non-RDF and RDF versions are informationally equivalent, I can see they can be considered as different forms of one dataset.

John

> -----Original Message-----
> From: Dave Reynolds [mailto:dave.e.reynolds@gmail.com]
> Sent: Wednesday, March 15, 2017 12:29 PM
> To: public-lod@w3.org
> Subject: Re: Relationship of dcat:Dataset and void:Dataset
> 
> For what it's worth, personally I agree with this analysis.
> 
> dcat:Dataset is best regarded as an abstract thing which then gets
> represented/expressed as an RDF graph or a set of RDBMS tables or
> whatever. Each of which can then be distributed/manifested in multiple
> different ways.
> 
> Hence for greatest cleanliness having something between a dcat:Dataset and
> the dcat:Distribution could make sense.
> 
> However, in practice its likely to be a complication too far for most uses.
> 
> Fundamentally the notion of a "dataset" itself doesn't really work in a linked
> data world. Datasets have boundaries - what's in or out of the set. The point
> of linked data is break down those boundaries.
> 
> Dave
> 
> On 15/03/17 10:59, Pano Maria wrote:
> > I am one of John Walker's colleagues, and as John says we've been
> > having some interesting discussions on this topic. I'm partial to the
> > first option he presents, as our situation is similar to the situation
> > that Alasdair described.
> >
> >
> >
> > As an example:
> >
> > We have a collection of data pertaining to the addresses and buildings
> > in the Netherlands that is distributed in many different ways: WFS,
> > WMS, GML data dump, etc. Our linked data version of this collection of
> > data is actually created by transforming one of these sources to
> > linked data and subsequently exposing this via a SPARQL endpoint and
> REST API's.
> >
> >
> >
> > In my view a dcat:Dataset is the *abstract* representation of some
> > collection of data. That is, I can say stuff about this dataset in an
> > abstract sense, like who the curator is, what the accrual periodicity
> > is, what the spatial extent is, when it was last updated, etc.,
> > without this collection of data having to have any specific concrete
> > form. This also fits well with the situation that Alasdair describes,
> > and the above example.
> >
> >
> >
> > In my opinion one resource should therefore not be an instance of both
> > dcat:Dataset and void:Dataset, since if we consider the definition of
> > void:Dataset: "A dataset is a set of RDF triples that are published,
> > maintained or aggregated by a single provider" [1], we're clearly not
> > describing an abstract collection of data.
> >
> >
> >
> > Now, where I agree it all becomes a bit muddy is when you think of a
> > set of RDF triples, i.e. a void:Dataset, being distributable in
> > several different ways (SPARQL endpoint, datadump, LDP API etc.) What
> > does that make our void:Dataset? Note that the properties of the
> > abstract dataset are still relevant to describe all these different
> > forms of the collection of data.
> >
> >
> >
> > So, maybe what we are missing is a way to distinguish an expression of
> > an (abstract) collection of data from the distribution of that
> > expression. Analogous to the `work -> expression -> manifestation ->
> > item` that the FRBR model [2] uses. That would lead to a dcat:Dataset
> > representing the abstract dataset, one or more expressions of this
> > dataset, e.g. an RDF expression and thus a void:Dataset, and
> > dcat:Distributions of those expressions.
> >
> >
> >
> > The downside is that it becomes quite philosophical...
> >
> >
> >
> > As it stands currently, I'm still inclined to consider a void:Dataset
> > a better match with dcat:Distribution than with dcat:Dataset, because
> > of the need to use dcat:Dataset in an expression-independent way.
> >
> >
> >
> > Kind regards,
> >
> >
> >
> > Pano Maria
> >
> >
> >
> > [1] https://www.w3.org/TR/void/

> >
> > [2] http://www.sparontologies.net/ontologies/fabio

> >
> >
> >
> > *Van:*Markus Freudenberg [mailto:markus.freudenberg@gmail.com]
> > *Verzonden:* woensdag 15 maart 2017 11:22
> > *Aan:* Gray, Alasdair J G
> > *CC:* John Erickson; John Walker; public-lod@w3.org;
> > public-dwbp-wg@w3.org
> > *Onderwerp:* Re: Relationship of dcat:Dataset and void:Dataset
> >
> >
> >
> > We had a very similar discussion about how to marry DCAT with VOID
> > (and what to do with void:Dataset) for DataID
> > <http://dataid.dbpedia.org/ns/core.html>.
> >
> >
> >
> > In the end, we decided to define dataid:Dataset as sub of dcat:Dataset
> > and void:Dataset for the following reasons:
> >
> >
> >
> > 1. their similar definitions :
> >
> >
> >
> >     void:Dataset "[...] we think of a dataset asa meaningful
> > collection of triples, that deal with a certain topic, originate from
> > a certain source or process, are hosted on a certain server, or are
> > aggregated by a certain custodian." [1]
> >
> >     dcat:Dataset "[...] collection of data, published or curated by a
> > single agent, and available for access or download in one or more
> > formats." [2]
> >
> >
> >
> > It appears, all of what is stated about a dcat:Dataset is true for a
> > void:Dataset (including the possibility of different formats).
> >
> >
> >
> > 2. the similarities between dcat:CatalogRecord and
> void:DatasetDescription:
> >
> >
> >
> > Both provide some form of metadata about a dataset. Both are using
> > foaf:topic / foaf:primaryTopc to point out the (Dataset) entity of interest.
> >
> > When combining DCAT and VOID using the first option, a
> > dcat:CatalogRecord would reference a dcat:Dataset, while a
> > void:DatasetDescription would reference a dcat:Distribution.
> >
> >
> >
> > 3. void:subset
> >
> >
> >
> > Points out a subset of a void:Dataset. If a void:Dataset is also
> > considered a dcat:Distribution, one would have to deal with the notion
> > of a 'sub-distributions'.
> >
> > Which is a point of contention (as far as I remember the discussion at
> > SDSVoc). We rather use this property with DataID to provide the
> > missing hierarchical pointers between datasets.
> >
> >
> >
> > 4.  The definition of dcat:Distribution
> >
> >
> >
> >     dcat:Distribution: "Represents a specific available form of a dataset."
> >
> >
> >
> > The definition of a void:Dataset is different since it only narrows
> > the available formats of a dataset to RDF, not to a specific serialization.
> > Also, no VOID properties offer no further clarification on the
> > 'specific available format' of the dataset.
> >
> >
> >
> > VOID Properties like:
> >
> > classes <http://vocab.deri.ie/void#classes> | distinctObjects
> > <http://vocab.deri.ie/void#distinctObjects> | distinctSubjects
> > <http://vocab.deri.ie/void#distinctSubjects> | documents
> > <http://vocab.deri.ie/void#documents> | entities
> > <http://vocab.deri.ie/void#entities> | properties
> > <http://vocab.deri.ie/void#properties> | property
> > <http://vocab.deri.ie/void#property> | propertyPartition
> > <http://vocab.deri.ie/void#propertyPartition> | triples
> > <http://vocab.deri.ie/void#triples> | vocabulary
> > <http://vocab.deri.ie/void#vocabulary> etc.
> >
> > are all characteristics of a dataset and not just a single
> > distribution, in my understanding.
> >
> >
> >
> > These were our main reasons to combine dcat:Dataset and void:Dataset
> > into dataid:Dataset.
> >
> >
> > Markus Freudenberg
> >
> >
> >
> > Release Manager, DBpedia <http://wiki.dbpedia.org>
> >
> >
> >
> > On Tue, Mar 14, 2017 at 5:10 PM, Gray, Alasdair J G
> > <A.J.G.Gray@hw.ac.uk <mailto:A.J.G.Gray@hw.ac.uk>> wrote:
> >
> >     When we were considering this in the Health Care and Life Sciences
> >     Community Profile [1] we took the view that the RDF representation
> >     was one of several possible distributions for a dataset and that it
> >     would be incorrect to associate that distribution information with
> >     the notion of the dataset itself. That is, we took the first
> >     approach proposed by John.
> >
> >
> >
> >     We specifically did this as not all HCLS datasets are made available
> >     in RDF and we did not want to make incorrect inferences.
> >
> >
> >
> >     Best regards,
> >
> >
> >
> >     Alasdair
> >
> >
> >
> >     [1] https://www.w3.org/TR/hcls-dataset/

> >
> >
> >
> >         On 14 Mar 2017, at 14:18, John Erickson <olyerickson@gmail.com
> >         <mailto:olyerickson@gmail.com>> wrote:
> >
> >
> >
> >         John makes a great argument for the second approach. That is how we
> >         tend to think of it.
> >
> >         As with most DCAT-related questions, start with "DCAT is like
> >         'Dublin
> >         Core' for datasets." In other words, general purpose, good for
> >         starters, accommodates refinements...
> >
> >         John
> >
> >         On Tue, Mar 14, 2017 at 9:59 AM, John Walker
> >         <john.walker@semaku.com <mailto:john.walker@semaku.com>>
> wrote:
> >
> >             Hello,
> >
> >
> >
> >             Following discussion with colleagues, I would like to ask
> >             for opinions on
> >             semantics of dcat:Dataset and void:Dataset.
> >
> >
> >
> >             We have two points of view.
> >
> >
> >
> >             First, the RDF version of a dcat:Dataset is a
> >             dcat:distribution of that
> >             dataset and is itself a void:Dataset.
> >
> >             That could be represented as follows:
> >
> >
> >
> >             <my-dataset> a dcat:Dataset ;
> >
> >              dcat:distribution <my-rdf-dataset> ;
> >
> >              .
> >
> >             <my-rdf-dataset> a dcat:Distribution , void:Dataset ;
> >
> >              void:sparqlEndpoint <sparql> ;
> >
> >             void:dataDump <my-dataset.rdf>, <my-dataset.ttl> ;
> >
> >              .
> >
> >
> >
> >             Secondly that a dcat:Dataset that is available as RDF (and
> >             possibly other
> >             forms) is also a void:Dataset.
> >
> >             Or to put it another way: void:Dataset rdfs:subClassOf
> >             dcat:Dataset.
> >
> >             That could be represented as follows:
> >
> >
> >
> >             <my-dataset> a dcat:Dataset, void:Dataset ;
> >
> >              dcat:distribution <my-sparql-distribution>,
> >             <my-rdfxml-distribution>,
> >             <my-turtle-distribution>;
> >
> >              void:sparqlEndpoint <sparql> ;
> >
> >              void:dataDump <my-dataset.rdf>, <my-dataset.ttl> ;
> >
> >              .
> >
> >             <my-sparql-distribution> a dcat:Distribution ;
> >
> >              dcat:accessURL <sparql> ;
> >
> >              .
> >
> >             <my-rdfxml-distribution> a dcat:Distribution ;
> >
> >              dcat:downloadURL <my-dataset.rdf> ;
> >
> >              dcat:mediaType "application/rdf+xml"
> >
> >              .
> >
> >             <my-turtle-distribution> a dcat:Distribution ;
> >
> >              dcat:downloadURL <my-dataset.ttl> ;
> >
> >              dcat:mediaType "text/turtle"
> >
> >              .
> >
> >
> >
> >             I’m trying to keep an open mind, but leaning towards the
> >             second method as
> >             thinking of the SPARQL endpoint, dumps and crawlable linked
> >             data (plus other
> >             forms such as an API or WFS endpoint) as different
> >             distributions of the same
> >             dataset seems to fit better with the spirit of DCAT (at
> >             least to my
> >             interpretation of the recommendation).
> >
> >
> >
> >             Thoughts welcome!
> >
> >
> >
> >             Regards,
> >
> >             John
> >
> >
> >
> >
> >         --
> >         John S. Erickson, Ph.D.
> >         Director of Operations, The Rensselaer IDEA
> >         Deputy Director, Web Science Research Center (RPI)
> >         <http://idea.rpi.edu/> <olyerickson@gmail.com
> >         <mailto:olyerickson@gmail.com>>
> >         Twitter & Skype: olyerickson
> >
> >
> >
> >     Alasdair J G Gray
> >
> >     Fellow of the Higher Education Academy
> >     Assistant Professor in Computer Science,
> >     School of Mathematical and Computer Sciences
> >     (Athena SWAN Bronze Award)
> >     Heriot-Watt University, Edinburgh UK.
> >
> >     Email: A.J.G.Gray@hw.ac.uk <mailto:A.J.G.Gray@hw.ac.uk>
> >     Web: http://www.macs.hw.ac.uk/~ajg33

> >     ORCID: http://orcid.org/0000-0002-5711-4872

> >     Office: Earl Mountbatten Building 1.39
> >     Twitter: @gray_alasdair
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ----------------------------------------------------------------------
> > --
> >
> >     Founded in 1821, Heriot-Watt is a leader in ideas and solutions.
> >     With campuses and students across the entire globe we span the
> >     world, delivering innovation and educational excellence in business,
> >     engineering, design and the physical, social and life sciences.
> >
> >     This email is sent from the Heriot-Watt University Group, which
> >     includes Heriot-Watt University, the Edinburgh Business School, and
> >     Heriot-Watt Services Ltd (Oriam, Scotland's national performance
> >     centre for sport). The contents (including any attachments) are
> >     confidential. If you are not the intended recipient of this e-mail,
> >     any disclosure, copying, distribution or use of its contents is
> >     strictly prohibited, and you should please notify the sender
> >     immediately and then delete it (including any attachments) from your
> >     system.
> >
> >
> >
Received on Wednesday, 15 March 2017 14:32:07 UTC