Re: The RDF Data Cube Vocabulary - W3C Working Draft 12 March 2013 from Dave Reynolds on 2013-04-06 (public-gld-comments@w3.org from April 2013)

From: Dave Reynolds <dave.e.reynolds@gmail.com>
Date: Sat, 06 Apr 2013 18:41:21 +0100
To: Ulrich <ulrich.atz@theodi.org>
CC: public-gld-comments@w3.org, Jeni Tennison <jeni@theodi.org>
Message-ID: <51605E41.6090908@gmail.com>
Hi Ulrich,

Thank you for your comments on the Data Cube vocabulary.

This is not yet the formal response to your comments.

If possible I'd like to ask for some clarification on them to determine 
what changes the WG will need to make in order to make the specification 
acceptable to you.

On 05/04/13 12:13, Ulrich wrote:
> Dear Editors,
>
> Jeni brought this draft to my attention. After reading the document,
> here are a few comments/suggestions. Bear in mind that I have little
> experience in W3C standards, so they may seem obvious or irrelevant.

[To explain the context this is a "Last Call". The specification has 
been out in draft form for one year (based on a draft that had been in 
use for nearly 2 years prior to that). The working group thinks it is 
done and is asking the community to confirm if they find the 
specification acceptable.

If the community review discovers a problem with the specification then 
it will need to be changed and normally a new Last Call would be issued. 
In our case the working group is closing very shortly so there will be 
no chance to take a further run at this. If the current version is not 
acceptable then the spec will simply progress no further.

However, we are allowed to make editorial changes if there are aspects 
of the presentation which are sufficiently unclear to warrant it. We can 
also remove the sections we flagged as at risk.

In order to move on to the next stage we have to demonstrate an audit 
trail that we have either addressed every comment submitted (and that 
the submitter accepts the outcome) or that we have adequate reasons for 
not doing so. Which is why this response is going to seem a bit long ...]

> Overall comment: Data Cube seems to be geared towards official
> statistics; other audiences may find it harder to grasp. Your first
> priority are the comments referring to example 5.3 and similar.
>
>   * Link to the SDMX User Guide 2.1, especially /2.2 Background/ eases
>     understanding for newcomers.

Given that Data Cube is currently based on the SDMX 2.0 information 
model, not the 2.1 model, I'd prefer not to link to 2.1 information. 
Though the 2.0 user guide does not have a directly equivalent section. 
We will consider what to do here.

>   * As an applied statistician, in my simple world, I think of
>     *datasets* in tabular format: rows and columns (+ metadata). E.g.
>     observers values of individuals  (rows) across characteristics such
>     as age (columns). Of course, a dataset may also consist of
>     aggregated data. The point is, if the concept of a dataset is used
>     in a more general format, it may be misinterpreted.

Is a specific change required here and if so is that change editorial or 
are you suggesting that the ontology itself needs to change?

The terminology of "data set" is in common use in Linked Data, and 
indeed the broader world, to refer to arbitrary collections of data not 
simply tabular. The terminology is also used (sadly spelt "DataSet") in 
the SDMX information model where it is clearly meant to cover aggregate 
statistics as well as "micro" statistics.

>   * Make examples earlier.

Could you be more specific?

[The first few sections of the document need to be there I think. 
Section 5 is the start of main part of the specification and introduces 
the running example 5.3. Sections 7 onwards pretty much start with, and 
are dominated by examples. So I'm guessing your suggestion primarily 
applies to section 6. Perhaps that  6.3 needs to be broken up and those 
examples integrated somehow into 6.1 & 6.2. Is that right? ]

>   * I'd recommend avoiding the term "non-statistical data" as I have
>     /only/ heard it in the context of official statistics. Or what
>     exactly makes data statistical? (see e.g. section 5.1)

Will look at rephrasing.

[Though it seems to me that some of the things for which Data Cube has 
been used (environmental and other sensor measures, weather forecasts, 
local authority payment records, company accounts) are pretty clearly 
not statistical data sets whether or not you can define what a 
statistical data set is :)]

>   * 2.3 Audience: expand with examples?

Examples of what?

>   * Section 5.1 "A set of values for all the dimension components is
>     sufficient to identify a single observation."  Would that imply
>     there cannot be two individuals with the same characteristics?

No.

>     Or
>     that such as dataset must include an unique ID even if created
>     artificially? What about data that is anonymised?

Yes, you need some way represent each of the dimensions. If your data 
set comprises values for some measurement on each individual, so that 
"individual" is one of your dimensions, then yes you will need to be 
able to encode a value for that dimension. That might be a "real" ID, an 
anonymized ID or the moral equivalent of the row number in some source data.

>   * 5.1 may also need more examples, say factor variables such as
>     gender.

In section 5.1 the listed example of dimension does include "gender" 
already.

>  They are usually stored as binaries or 2/1 and come with a
>     label for female/male.

Sure and the running example includes gender and shows how in Data Cube 
one would typically use a skos Concept Scheme to encode the gender, the 
skos:notation for which would be your binary, or numeric or label encoding.

>     Make explicit what would be the measure
>     and attribute components.

Section 5.1 does list illustrative examples of measures "(e.g. economic 
activity, population)" and attributes "(e.g. units, multipliers, status)".

> Or refer to a later section that addresses
>     the sex.

OK.

>   * The slice example is good, but could do with a shorter sentence.

Specifically which is the sentence that needs to be shortened?

>   * I'm confused about the use of "metadata" now. Is it metadata about
>     the whole dataset or/and about a single observations?

Metadata in the sense of section 9 (dct:publisher, dct:subject etc) 
apply to the whole dataset.

Metadata which applies to a single observation and which is needed in 
order the interpret the observation should be represented using 
attributes (e.g. units of measure, measurement procedure used etc).

In normal RDF fashion there is no barrier to also attaching informative 
metadata to an individual observation but the Data Cube specification 
makes no particular recommendation about that. However, it is one 
advantage of encoding data in RDF that you get the ability to identify 
each "cell" in the data and freely annotate it.

>   * For practical use example 5.3 is unwieldy; the long format is
>     arguably more common. [1] What we see here is I'd refer to as a
>     *data* *table *not dataset. Most statistical programs only read data
>     in tabular format.

Example 5.3 is a presentation format in order to allow the reader to 
understand the data that will be used in the running example.

It does not mean that the data will necessarily be delivered in that 
format, though in fact in this specific case it was.

Agreed that for actually ingesting and processing data then "long 
format" is much easier to work with. However that's orthogonal to the 
Data Cube specification which is about how to represent such information 
in RDF.

[Off topic: I'm told by my nearest friendly statistician that Stata is 
particularly good at converting between long and wide formats and that 
arguably the terminology of "long/wide" originated there.]

>   * Perhaps include what happens to the metadata in example 5.3 as well
>     ("StatsWales <http://statswales.wales.gov.uk/index.htm> report
>     number 003311" etc.)

The examples in section 9 do apply to the data from 5.3 showing issue 
date, subject categories and title. The notion of a report number is not 
part of the recommended core minimum set of metadata and I'd prefer to 
keep the core if we can.

>   * Example 5.3 -- /can we have some actual final code in there? /Even
>     if it anticipates some sections.

By "final code" do you mean the RDF representation as a Data Cube?

That would be too long to include in-line but one option might be to put 
the complete RDF version in an appendix and link to that from section 5.3.

>   * Example 6.3 I find it hard to see where we define the nested
>     structure of the data - include reference to example 4 or call it
>     something more telling than "example".

OK, will retitle that section.

>   * Section 7. (before 7.1 - generally I'd avoid sections without
>     subheaders) Can these definitions come earlier? -- suddenly
>     explained a lot more.

Perhaps those definitions (rephrased to take into account Guillaume's 
comments) could be moved into 5.1 or thereabouts.

>   * So qb:data*S*et and qb:*D*ata*S*et are different…

Correct.

The use of capital-S is, as explained in the document, for consistency 
with SDMX. It's a pain since most other vocabularies use "dataset" but 
since that's what SDMX use we decided to accept that pain.

The convention of having a property start with lower-case and class 
start with upper-case is so deeply embedded in this community that 
having pairs of class/property than only differ by initial case is 
relatively common. There is an extensive parallel thread on this 
regarding the DCAT vocabulary if you are interested.

Given that this is a defensible convention and that there are 
substantial numbers of data sets that have been published using the 
existing version of the data cube vocabulary then it would take an 
overwhelmingly good technical reason to change this at this stage.

>   * Unfortunately, I cannot comment on section 10 and 11.
>
>
> Reading this guide, it might be relatively easy to provide a tool
> (algorithm) which translates a simple and "well-behaved" dataset into a
> DataCube syntax.

Indeed. We, and I'm sure others, have various ad hoc tooling for this 
though I'm not aware of any publicly provided generic Data Cube 
conversion toolkit.


To sum up. Would it be right to say that your comments are essentially 
editorial and that you are not proposing that any change is required to 
the design itself? So if we can address sufficient of your editorial 
issues you would be prepared to deem the current specification acceptable?

Dave
Received on Saturday, 6 April 2013 17:41:56 UTC