RE: ID 47

Karen,

Thanks for your more detailed thoughts on the issue of dataset updates. 

It seems to me that your approach goes deeper than my earlier message on the related subject of versioning, https://lists.w3.org/Archives/Public/public-dxwg-wg/2017Jun/0006.html

In addition to looking at the dataset as a black box as I did, your patterns go into the actual changes within the dataset. In the six types of relationships that I listed, the information tells the user that the dataset changed and how it changed, but only on a fairly general level. As far as I understand, something like your pattern #4 moves into territory that DCAT has not considered until now.

Maybe we should try to separate these patterns into two categories? One category that considers the 'black box' relationships in a general sense, and one category that describes the exact changes that were applied to the contents of the dataset. 

For the second category, we may need to do some general modelling of abstract 'records' that make up a dataset. After all, in different types of datasets 'records' could come in many forms, e.g. rows, columns, cells in a spreadsheet, articles in legislation, chapters in text, bibliographic records, layers in still images, frames in moving images, faces and vertices in 3D models etc. etc.

Would this be in scope of the working group?

Makx.


-----Original Message-----
From: Karen Coyle [mailto:kcoyle@kcoyle.net] 
Sent: 16 August 2017 16:59
To: public-dxwg-wg@w3.org
Subject: ID 47

This is about ID47[1], which is about a certain relationship between between datasets: how they serve as updates one to the other.

This may be a narrower use case than ID32[2], which is about relationships between datasets.

1. The simplest case is that successive datasets over time are more recent versions of the data. The newer dataset may render older datasets obsolete in some cases. In other cases, such as in successive censuses earlier datasets may be useful for applications like longitudinal studies. The key is that each dataset is complete in itself.

2. Another case is that datasets are additive - dataset B adds to dataset A. An example would be that dataset A is a CSV file with rows
1-99 and dataset B is a CSV file with rows 100-199. This is similar to a part/whole relationship, except that there is not necessarily a "whole", just parts, which are produced generally at different times. The datasets can be combined into a single dataset. The value of using the individual datasets on their own can vary.

3. A version of #2 (which may not need to be distinguished from it) is the publication pattern such as "monthly" where there is a base cumulative dataset and then periodic additive files until the next time that a cumulative dataset is produced. (This was a vital pattern in analog resources, but may be less used for digital ones.) It is probably expected that recipients can combine datasets in their applications, or at least treat them as a single dataset virtually.

4. This extends the concepts in #2 and #3. In this scenario, there is a "master" database that is updated in place. Other sites have copies of the database, and receive (or request/pull) updates. The update files contain "records" that, which processed, will result in a file or database that is in the same state as the "master" database. The files contain new records, changed records (that must replace the older records with the same record ID), and delete records (that must be used to delete the older record with the same record ID). These files have minimal value on their own except as they can be used to update the master dataset. This update method is one that is used heavily in the library community. In the US, the Library of Congress holds the master database but the records are also stored and used by many dozens of institutions across the country (and the world).

These may not be all of the relevant types of update; suggestions welcome.

Update patterns can be very complex, so this is another case in which DCAT may need to define a small number of very common values, with a hand-off to "somewhere else" for the long tail. It may also be useful for data consumers to know immediately whether a dataset is "stand alone" or requires other datasets to be complete.

kc
[1] https://w3c.github.io/dxwg/ucr/#ID47
[2] https://w3c.github.io/dxwg/ucr/#ID32
--
Karen Coyle
kcoyle@kcoyle.net http://kcoyle.net
m: 1-510-435-8234 (Signal)
skype: kcoylenet/+1-510-984-3600

Received on Wednesday, 16 August 2017 19:03:26 UTC