- From: Makx Dekkers <mail@makxdekkers.com>
- Date: Fri, 19 Jun 2015 11:41:24 +0200
- To: "'Data on the Web Best Practices Working Group'" <public-dwbp-wg@w3.org>
- Message-ID: <000001d0aa74$1c330ba0$549922e0$@makxdekkers.com>
Just on the issue of data versioning: > > * Data Versioning > The chart describes time series data, not versions of data. I would say that, if > released independently, the items in yellow each represent a different > dataset (they report different data points), not a different version. If you > revised any of them, then the original and the revision would be different > versions. I think by definition, versions attempt to report the same data. > As I said in last week's call, this is related to the more general issue of relationships between data files. A first issue to tackle is what the word 'version' means. In work that I have been doing it was sort of agreed that a set of data points is a 'version' of another set if the sets share a lot of data, e.g. where version 2 has the same data as version 1 but maybe with corrections or additions, which I think is what Annette's perspective is. An example is a Year-to-Date (YTD) file that is updated every month. In some cases, the producer of that data may need to keep the individual snapshots available (e.g. for auditing); in other cases it may not be necessary to keep track of changes, in which case it will be considered to be the "same" data (and then just the date of last modification is updated). I understand that PAV has a wider definition in which 'versions' can also be the same kind of data for different periods (time series) or spatial areas (spatial series). In the PAV approach, the public budget 2013, public budget 2014 and public budget 2015 are all 'versions' of something that can be called "Annual budgets". In this approach, the semantics of the word 'version' is different. The second issue is how such approaches map to the DCAT model. DCAT makes a distinction between the 'conceptual' Dataset and the 'physical' Distribution. DCAT (http://www.w3.org/TR/vocab-dcat/) is completely silent about how to model these types of relationships. It can be argued that the concept of Dataset is very flexible and does not necessarily apply to the "same" data. Various approaches can be proposed: * Monthly snapshots of the YTD file could be modelled as Distributions of the same Dataset. If you treat the YTD file as an unversioned file, the Dataset will have just one Distribution; if you keep snapshots, the Dataset will have multiple time-stamped Distributions. But of course, you could also model the snapshots as different Datasets, depending on whether the publisher wants to give direct access to the snapshots as opposed to accessing the set of snapshots as a group. * Also, the yearly budgets example could be modelled either as Distributions of a single "Annual budgets" Dataset, or as separate Datasets. This also depends on the access that the publisher wants to give: either facilitating access to the whole set of budgets, or to the individual years. One of the issues with the diagram at http://w3c.github.io/dwbp/bp.html#dataVersioning is that in DCAT there is currently no property that links the blue dataset with the yellow ones; someone proposed dct:hasPart rather than dct:hasVersion where the latter would be a relationship between the yellow boxes. Someone else argued that calling the blue one a Dataset stretches the semantics of Dataset because it is essentially a grouping concept, not a conceptual view on a data file. The alternative perspective is a diagram where the red boxes are directly attached to the blue box, deleting the yellow level. So, two issues that we would need to discuss: * What is the definition of 'version' in relation to the sameness or similarity of the data, and in particular does the concept of 'version' include time series and spatial series? * How are 'versions' mapped onto DCAT Dataset and Distribution? If we go into this discussion, I think it is also necessary to look at the issue from a practical perspective -- what do people do at the moment? Looking at it from a theoretical perspective may lead to very long and possibly unproductive discussions. And maybe we should end up with multiple sets of "if this is your perspective, do it this way" rather than choosing the 'right' way. Makx.
Received on Friday, 19 June 2015 09:41:59 UTC