RE: Relating versions and UC47 (Define update method)

I'm trying to not get sucked into the versioning discussion, but feel the need to draw attention to this work from Research Data Alliance, who two years ago developed guidelines on a very closely related topic - citation of dynamic datasets - i.e. how to identify a particular state of a dataset that is being continuously updated. The main link is here
https://www.rd-alliance.org/group/data-citation-wg/outcomes/data-citation-recommendation.html  and there is a longer paper here: 
https://www.rd-alliance.org/system/files/documents/TCDL-RDA-Guidelines_160411.pdf 

Seems to me that the notion of 'version' is usually a publisher's choice to assign a memorable identifier to a product, which may have many more intermediate changes from the last 'version'. Version control systems talk about 'tags' and 'releases' which are usually along a more-or-less continuous development path. Criteria for versions will vary depending on the application. There is no way we can be prescriptive on this, except for the requirement for transparency from the publisher, so perhaps the focus should be on a framework for enabling a publisher to describe their criteria, with the various concerns that apply. 

The key concern of the RDA work was to support the retrieval of any previous state (though not necessarily instantaneously). 

Simon 

-----Original Message-----
From: Karen Coyle [mailto:kcoyle@kcoyle.net] 
Sent: Wednesday, 27 September, 2017 03:39
To: public-dxwg-wg@w3.org
Subject: Relating versions and UC47 (Define update method)

Here's a (much) more coherent statement of something I started to say during the meeting yesterday but didn't have my thoughts together.

I created use case 47[1] because I felt that there is an unspoken assumption behind the discussion of "versions" - which is that each version is a complete replacement for the previous one(s). That is how I read the statement about the version delta: "indicating the "type" of change (addition/removal/update of data etc.)"[2] The implied subject if that is a single dataset that has been changed. If that is the case, then we can use "version" in that way. However, there are other situations that are not captured by that definition but that will arise in practice.

The example I gave in use case 47 is one in which there is a master dataset, and that additions and changes to that dataset are issued in transaction files. A transaction file will have a newer date (or some other sequential numbering), but it is not a "version" of the master file; instead, it must be applied to the master file to create a new master file.

This is only one kind of update. There are also sequential datasets that may or may not be stand-alone. That is analogous to the issues of a serial publication. This may include periodic datasets like census information - each new census provides new information, but would we call a later census file a version of an earlier one?

Use case 44 [3] (Identification of versioned datasets and subsets) is also related to this question because it addresses the part/whole relationship between datasets. Use case 32 [4] (Relationships between
datasets) has elements of this question as well, although it emphasizes the type of derivation or part/whole relationship.

It may be best to make a clear separation between versions of a dataset and related datasets that are not one-to-one replacements for another.
If nothing else, our definition of versions needs to make clear what types of relationships are included in the declaration that one dataset is a version of another. This is what I mainly find to be missing.

kc
[1] https://w3c.github.io/dxwg/ucr/#ID47

[2] https://lists.w3.org/Archives/Public/public-dxwg-wg/2017Sep/0051.html

[3] https://w3c.github.io/dxwg/ucr/#ID44

[4] https://w3c.github.io/dxwg/ucr/#ID32



--
Karen Coyle
kcoyle@kcoyle.net http://kcoyle.net

m: 1-510-435-8234 (Signal)
skype: kcoylenet/+1-510-984-3600

Received on Tuesday, 10 October 2017 03:32:22 UTC