The RDF Data Cube Vocabulary - W3C Working Draft 12 March 2013

Dear Editors, 

Jeni brought this draft to my attention. After reading the document, here are a few comments/suggestions. Bear in mind that I have little experience in W3C standards, so they may seem obvious or irrelevant. 

Overall comment: Data Cube seems to be geared towards official statistics; other audiences may find it harder to grasp. Your first priority are the comments referring to example 5.3 and similar.

Link to the SDMX User Guide 2.1, especially 2.2 Background eases understanding for newcomers.
As an applied statistician, in my simple world, I think of datasets in tabular format: rows and columns (+ metadata). E.g. observers values of individuals  (rows) across characteristics such as age (columns). Of course, a dataset may also consist of aggregated data. The point is, if the concept of a dataset is used in a more general format, it may be misinterpreted.
Make examples earlier.
I'd recommend avoiding the term "non-statistical data" as I have only heard it in the context of official statistics. Or what exactly makes data statistical? (see e.g. section 5.1)
2.3 Audience: expand with examples? 
Section 5.1 "A set of values for all the dimension components is sufficient to identify a single observation."  Would that imply there cannot be two individuals with the same characteristics? Or that such as dataset must include an unique ID even if created artificially? What about data that is anonymised?
5.1 may also need more examples, say factor variables such as gender. They are usually stored as binaries or 2/1 and come with a label for female/male. Make explicit what would be the measure and attribute components. Or refer to a later section that addresses the sex.
The slice example is good, but could do with a shorter sentence.
I'm confused about the use of "metadata" now. Is it metadata about the whole dataset or/and about a single observations?
For practical use example 5.3 is unwieldy; the long format is arguably more common. [1] What we see here is I'd refer to as a data table not dataset. Most statistical programs only read data in tabular format.
Perhaps include what happens to the metadata in example 5.3 as well (" StatsWales report number 003311" etc.)
Example 5.3 -- can we have some actual final code in there? Even if it anticipates some sections.
Example 6.3 I find it hard to see where we define the nested structure of the data - include reference to example 4 or call it something more telling than "example". 
Section 7. (before 7.1 - generally I'd avoid sections without subheaders) Can these definitions come earlier? -- suddenly explained a lot more.
So qb:dataSet and qb:DataSet are different…
Unfortunately, I cannot comment on section 10 and 11.

Reading this guide, it might be relatively easy to provide a tool (algorithm) which translates a simple and "well-behaved" dataset into a DataCube syntax. This would greatly invite new users to play around and familiarise themselves with the vocabulary. There may be substantial reasons (e.g. manual specifications) why this is not possible, but I am not aware of the details. 

Hope that helps,
Ulrich


---
Ulrich Atz, Statistician at the ODI
+44 (0) 20 3598 9395 @panoramadata
The ODI, 65 Clifton Street, London EC2A 4JE




[1] The data table in example 5.3 as a classical dataset:

Long format 
Region	Years	Male	Female
Newport	2004-2006	76.7	80.7
Cardiff	2004-2006	78.7	83.3
Monmouthshire	2004-2006	76.6	81.3
Merthyr Tydfil	2004-2006	75.5	79.1
Newport	2005-2007	77.1	80.9
Cardiff	2005-2007	78.6	83.7
Monmouthshire	2005-2007	76.5	81.5
Merthyr Tydfil	2005-2007	75.5	79.4
Newport	2006-2008	77.0	81.5
Cardiff	2006-2008	78.7	83.4
Monmouthshire	2006-2008	76.6	81.7
Merthyr Tydfil	2006-2008	74.9	79.6


And the less common wide format
Region	Male2004-2006	Female2004-2006	Male2005-2007	Female2005-2007	Male2006-2008	Female2006-2008
Newport	76.7	80.7	77.1	80.9	77.0	81.5
Cardiff	78.7	83.3	78.6	83.7	78.7	83.4
Monmouthshire	76.6	81.3	76.5	81.5	76.6	81.7
Merthyr Tydfil	75.5	79.1	75.5	79.4	74.9	79.6

Received on Friday, 5 April 2013 20:54:54 UTC