comments on WD-vocab-data-cube-20130312 from Bill Roberts on 2013-04-05 (public-gld-comments@w3.org from April 2013)

From: Bill Roberts <bill@swirrl.com>
Date: Fri, 5 Apr 2013 09:44:37 +0100
To: public-gld-comments@w3.org
Message-Id: <283D2749-9743-4016-9A8F-DC489B64E101@swirrl.com>

I've reviewed the Data Cube working draft and just to say that overall I think it is excellent: I have been using earlier versions of this for some time and it is great to see it going through the W3 standardisation process. It is one of the most important vocabularies for linked data in a government context.

I have a few specific comments, mainly on the newer features. Apologies that I haven't kept track of these through the drafting process. I can see that commenting earlier could have been more useful.

Sections 8.2 and 8.3 on hierarchical and non-SKOS code lists are a very useful addition addressing a commonly experienced requirement for modelling data cubes with a geographical element.

Section 9 has some overlap with VOID and DCAT, but is probably worth having separately in the DataCube document as well. The core set of metadata terms seems a good choice.

When data cube datasets appear in data catalogues, it can be useful if the geographical and spatial coverage and granularity can be identified. In the case of data cubes, this can often be deduced in an automated way, but may be involved, so identifying these aspects in metadata may be useful. Again, this issue has some overlap with DCAT, but it is particularly relevant for data cube datasets.

For example, it is useful to say that the spatial coverage of a dataset is England and the granularity is (say) 'Local Authority Distict'. The time coverage is 2001-2012 and the temporal granularity is quarterly.

Would it be useful to have some optional but standard metadata terms to describe these? Perhaps with a standardised relationship to the range of relevant dimension properties.

Section 10: I have generally found it most useful/convenient to use the verbose normalized form of data cube.

I have no objections to the abbreviated form, but note that due to the high level of repetition, normalized data cube datasets can be efficiently compressed by the usual compression algorithms, if transmission time or offline storage space is an issue. I would not anticipate using the normalization algorithm.

11. Well-formed cubes
I see the value of having a definition of a well-formed cube with associated automatable checks, but I wonder if the list is perhaps too long: i.e. some aspects of this might be better as optional enrichments to the data cube rather than something that is seen as 'required'.

In an environment where we still have to work hard to persuade people to climb the learning curve of linked data, we should be wary of making the curve even steeper.
Of course a publisher could choose to publish non-well-formed cubes which could still be used in most circumstances, but this section could lead to those approaches being seen as 'wrong'.

Are there specific use cases in mind driving the well-formedness criteria, eg for data cube processing toolsets? (I can imagine there is, but not sure exactly what the authors have in mind.)

IC-2. I have generally found little use for data structure definitions, as when using a normalized form of the data cube, the data structure is implicitly but clearly available in each observation. It doesn't particularly help someone writing SPARQL queries against the data. I would prefer if a data structure definition could be viewed as optional, but if others are using these regularly (for example to support data cube viewing software?), then I can see the benefit of consistently including a DSD. It's obviously not difficult to do.

IC-4 and IC-5. It's common to use 'standard' dimensions and codelists from the SDMX/RDF vocabularies that have been developed alongside the data cube work (though I understand that these are not part of the working draft currently being reviewed). I have used sdmx-dimension:refArea and sdmx:dimension:refPeriod extensively and indeed these are used throughout this document in the examples. As far as I'm aware sdmx-dimension.ttl does not declare a range for these. The requirement to have an explicit range would mean that it is always necessary to declare one's own subproperties for refArea and refPeriod. Perhaps that is a good thing of course, as it allows conclusions to be drawn about the values of those dimensions in a dataset.

In some cases, being less specific could be useful. Using the standard properties offers some gain in interoperability and readability. This issue could be solved by declaring a (fairly generic) range for refArea and refPeriod I suppose.

I hope that's useful

Best regards

Bill

Bill Roberts, CEO, Swirrl IT Limited
Mobile: +44 (0) 7717 160378 | Skype: billroberts1966 | @billroberts
http://swirrl.com | http://publishmydata.com

Received on Friday, 5 April 2013 08:45:13 UTC