RE: Relating versions and UC47 (Define update method)

I also agree with Simon and Rob that we cannot be prescriptive about what a "version" is and how it is identified.

Restating Simon's point, I think we are dealing with a notion – as the one of "dataset" – which is used with different meanings by different communities - and they know exactly what a "version" is. Moreover, what a "version" is also very much related to the data management policy / workflow in place. And this affects how different versions of a dataset are modelled.

It might be useful to have a look at the discussion on this topic carried out in the DCAT-AP WG, that highlighted quite a few different perspectives – and coming up with an agreement turned out to be quite problematic. This issue was further discussed during the work on the implementation guidelines of DCAT-AP, and the result was not to define what is or is not a version, but rather an explanation of different possible ways of modelling it, based on implementation evidence. The summary is available here:

As you can read there, we have examples where different versions of a dataset are modelled with distributions, or as different datasets in a series, possibly in combination with a statement saying which is the previous / next version (by using dct:hasVersion / dct:isVersionOf, respectively). And we have also to consider cases when datasets are updated (on a regular or irregular basis) but the old versions are not maintained (this frequently happens, e.g., for datasets updated daily).

I think the lesson learnt in DCAT-AP is that what users are looking for is:

1. Having guidance on how to model dataset versions (i.e., with different datasets, different distributions, etc.), based on evidence from similar use cases / domains. This requirement mainly applies to communities where the notion of dataset "version" is not established / clearly defined.

2. Having clear information on which are the relevant terms (classes, properties) in DCAT, and on how to use them. This requirement apply to all users.

About point (2), I take this opportunity to add a note here - also about some of them that I'm not sure have been mentioned so far in our discussion:

- dct:modified [1] and dct:accrualPeriodicity [2]: These properties provide implicit information about a dataset version – especially when combined with the issue and/or creation date –, that can be used also when old versions are not maintained.

- About the issue raised by Rob about previous/next/current version, dct:hasVersion [3] and dct:isVersionOf [4] are actually meant to model exactly previous / next versions. Moreover, there is also adms:prev [5] and adms:next [6], plus adms:last [7] for the latest version (@Rob, I'm not sure if with "current" version you actually mean this).










Andrea Perego, Ph.D.
Scientific / Technical Project Officer
European Commission DG JRC
Directorate B - Growth and Innovation
Unit B6 - Digital Economy
Via E. Fermi, 2749 - TP 262
21027 Ispra VA, Italy

The views expressed are purely those of the writer and may
not in any circumstances be regarded as stating an official
position of the European Commission.

From: Rob Atkinson []
Sent: Tuesday, October 10, 2017 6:19 AM
Subject: Re: Relating versions and UC47 (Define update method)

+1  We cannot be prescriptive about what constitutes a version, nor how a version identifier is represented.

What we can be prescriptive about are how versions are identified - i.e. the name of DCAT properties that refer to versions of a DCAT Dataset description, the dataset described by this description and version of DCAT Distribution.

We can also require that identifiers are lexically comparable, so that if A is lexically > B then the version denoted by A is later than the version denoted by B. (and if A = B then version is the same)

If a version designator is a URI, it could dereference to a "model" - however DCAT profiles could use third party vocabularies to define properties for such models, and have a simple string property in DCAT core.

We probably need special properties in DCAT to handle "previous/next/current version" problems.

Which leaves open whether we need another special property to indicate the type of version, and a set of defined literals for common cases.

Any statistics about change should be through a deferenceable version model, defined by the application domain.

<descends into solution space...>

IMHO its important we have one consistent pattern for these types of situations where we promote some special semantics to dcat properties, but also want to use dcat Classes to act as subjects for discovery of domain-specific properties.

The pattern seems to be a combination of simple DataProperties for DCAT core properties, and extension points using defined ObjectProperties whose type is controlled by domain profiles. Such ObjectProperties may be canonically defined in DCAT, or external vocabularies also defined by domain profiles. Do we want a simple pattern:

dcat:prop a owl:DataProperty

dcat:propLink a owl:ObjectProperty

Rob Atkinson

On Tue, 10 Oct 2017 at 14:31 <<>> wrote:
I'm trying to not get sucked into the versioning discussion, but feel the need to draw attention to this work from Research Data Alliance, who two years ago developed guidelines on a very closely related topic - citation of dynamic datasets - i.e. how to identify a particular state of a dataset that is being continuously updated. The main link is here  and there is a longer paper here:

Seems to me that the notion of 'version' is usually a publisher's choice to assign a memorable identifier to a product, which may have many more intermediate changes from the last 'version'. Version control systems talk about 'tags' and 'releases' which are usually along a more-or-less continuous development path. Criteria for versions will vary depending on the application. There is no way we can be prescriptive on this, except for the requirement for transparency from the publisher, so perhaps the focus should be on a framework for enabling a publisher to describe their criteria, with the various concerns that apply.

The key concern of the RDA work was to support the retrieval of any previous state (though not necessarily instantaneously).


-----Original Message-----
From: Karen Coyle [<>]
Sent: Wednesday, 27 September, 2017 03:39
Subject: Relating versions and UC47 (Define update method)

Here's a (much) more coherent statement of something I started to say during the meeting yesterday but didn't have my thoughts together.

I created use case 47[1] because I felt that there is an unspoken assumption behind the discussion of "versions" - which is that each version is a complete replacement for the previous one(s). That is how I read the statement about the version delta: "indicating the "type" of change (addition/removal/update of data etc.)"[2] The implied subject if that is a single dataset that has been changed. If that is the case, then we can use "version" in that way. However, there are other situations that are not captured by that definition but that will arise in practice.

The example I gave in use case 47 is one in which there is a master dataset, and that additions and changes to that dataset are issued in transaction files. A transaction file will have a newer date (or some other sequential numbering), but it is not a "version" of the master file; instead, it must be applied to the master file to create a new master file.

This is only one kind of update. There are also sequential datasets that may or may not be stand-alone. That is analogous to the issues of a serial publication. This may include periodic datasets like census information - each new census provides new information, but would we call a later census file a version of an earlier one?

Use case 44 [3] (Identification of versioned datasets and subsets) is also related to this question because it addresses the part/whole relationship between datasets. Use case 32 [4] (Relationships between
datasets) has elements of this question as well, although it emphasizes the type of derivation or part/whole relationship.

It may be best to make a clear separation between versions of a dataset and related datasets that are not one-to-one replacements for another.
If nothing else, our definition of versions needs to make clear what types of relationships are included in the declaration that one dataset is a version of another. This is what I mainly find to be missing.





Karen Coyle<>

m: 1-510-435-8234 (Signal)
skype: kcoylenet/+1-510-984-3600<tel:+1%20510-984-3600>

Received on Tuesday, 10 October 2017 08:14:30 UTC