RE: reviewing the BP doc from Makx Dekkers on 2015-06-19 (public-dwbp-wg@w3.org from June 2015)

From: Makx Dekkers <mail@makxdekkers.com>
Date: Fri, 19 Jun 2015 11:41:24 +0200
To: "'Data on the Web Best Practices Working Group'" <public-dwbp-wg@w3.org>
Message-ID: <000001d0aa74$1c330ba0$549922e0$@makxdekkers.com>
Just on the issue of data versioning:

> 
> * Data Versioning
> The chart describes time series data, not versions of data. I would say
that, if
> released independently, the items in yellow each represent a different
> dataset (they report different data points), not a different version. If
you
> revised any of them, then the original and the revision would be different
> versions. I think by definition, versions attempt to report the same data.
> 

As I said in last week's call, this is related to the more general issue of
relationships between data files.

A first issue to tackle is what the word 'version' means. In work that I
have been doing it was sort of agreed that a set of data points is a
'version' of another set if the sets share a lot of data, e.g. where version
2 has the same data as version 1 but maybe with corrections or additions,
which I think is what Annette's perspective is. An example is a Year-to-Date
(YTD) file that is updated every month. In some cases, the producer of that
data may need to keep the individual snapshots available (e.g. for
auditing); in other cases it may not be necessary to keep track of changes,
in which case it will be considered to be the "same" data (and then just the
date of last modification is updated).

I understand that PAV has a wider definition in which 'versions' can also be
the same kind of data for different periods (time series) or spatial areas
(spatial series). In the PAV approach, the public budget 2013, public budget
2014 and public budget 2015 are all 'versions' of something that can be
called "Annual budgets". In this approach, the semantics of the word
'version' is different.

The second issue is how such approaches map to the DCAT model. DCAT makes a
distinction between the 'conceptual' Dataset and the 'physical'
Distribution. DCAT (http://www.w3.org/TR/vocab-dcat/) is completely silent
about how to model these types of relationships. It can be argued that the
concept of Dataset is very flexible and does not necessarily apply to the
"same" data. Various approaches can be proposed:

*	Monthly snapshots of the YTD file could be modelled as Distributions
of the same Dataset. If you treat the YTD file as an unversioned file, the
Dataset will have just one Distribution; if you keep snapshots, the Dataset
will have multiple time-stamped Distributions. But of course, you could also
model the snapshots as different Datasets, depending on whether the
publisher wants to give direct access to the snapshots as opposed to
accessing the set of snapshots as a group.
		
*	Also, the yearly budgets example could be modelled either as
Distributions of a single "Annual budgets" Dataset, or as separate Datasets.
This also depends on the access that the publisher wants to give: either
facilitating access to the whole set of budgets, or to the individual years.

One of the issues with the diagram at
http://w3c.github.io/dwbp/bp.html#dataVersioning is that in DCAT there is
currently no property that links the blue dataset with the yellow ones;
someone proposed dct:hasPart rather than dct:hasVersion where the latter
would be a relationship between the yellow boxes. Someone else argued that
calling the blue one a Dataset stretches the semantics of Dataset because it
is essentially a grouping concept, not a conceptual view on a data file. The
alternative perspective is a diagram where the red boxes are directly
attached to the blue box, deleting the yellow level.

So, two issues that we would need to discuss:

*	What is the definition of 'version' in relation to the sameness or
similarity of the data, and in particular does the concept of 'version'
include time series and spatial series?
*	How are 'versions' mapped onto DCAT Dataset and Distribution?

If we go into this discussion, I think it is also necessary to look at the
issue from a practical perspective -- what do people do at the moment?
Looking at it from a theoretical perspective may lead to very long and
possibly unproductive discussions. And maybe we should end up with multiple
sets of "if this is your perspective, do it this way" rather than choosing
the 'right' way.

Makx.
Received on Friday, 19 June 2015 09:41:59 UTC