Re: The RDF Data Cube Vocabulary - W3C Working Draft 12 March 2013 from Ulrich on 2013-04-08 (public-gld-comments@w3.org from April 2013)

From: Ulrich <ulrich.atz@theodi.org>
Date: Mon, 8 Apr 2013 17:54:18 +0100
To: Dave Reynolds <dave.e.reynolds@gmail.com>
Cc: public-gld-comments@w3.org, Jeni Tennison <jeni@theodi.org>
Message-Id: <01F0641C-D8EA-46F7-A7C7-EB80DF91D795@theodi.org>
Hi Dave,

Overall: yes the comments are editorial and aimed at helping newcomers such myself grasp the concept. All your comments I don't address specifically are fine and make sense.

To ease understanding, the most effective change I suggest is moving the part of section 7 without sub-heading forward.

Furthermore:

>>  * I'd recommend avoiding the term "non-statistical data" as I have
>>    /only/ heard it in the context of official statistics. Or what
>>    exactly makes data statistical? (see e.g. section 5.1)
> 
> Will look at rephrasing.
> 
> [Though it seems to me that some of the things for which Data Cube has been used (environmental and other sensor measures, weather forecasts, local authority payment records, company accounts) are pretty clearly not statistical data sets whether or not you can define what a statistical data set is :)]


I believe weather forecasts, for example, rely heavily on statistics. Having said that, if you argue that "non-statistical" is understood by the majority of users I have no further concerns.

>>  * 2.3 Audience: expand with examples?
> 
> Examples of what?


I was again thinking of the SMX User Guide 2.1, but you answered that already. 

>>  * The slice example is good, but could do with a shorter sentence.
> 
> Specifically which is the sentence that needs to be shortened?


"For example, given a data set on regional performance indicators then we might group all the observations about a given indicator and a given region into a slice, each slice would then represent a time series of observed values." It sounds simple now, but I do remember that I had to read it more than once the first time to get the meaning. I don't have a better suggestion at the moment, so up to you.


>>  * For practical use example 5.3 is unwieldy; the long format is
>>    arguably more common. [1] What we see here is I'd refer to as a
>>    *data* *table *not dataset. Most statistical programs only read data
>>    in tabular format.
> 
> […]
> 
> [Off topic: I'm told by my nearest friendly statistician that Stata is particularly good at converting between long and wide formats and that arguably the terminology of "long/wide" originated there.]


You may be spot on re Stata. Again, this was a comment from a practitioner's perspective, so the example may be perfectly suited for explaining DC.


Hope that helps,
Ulrich




On 6 Apr 2013, at 18:41, Dave Reynolds <dave.e.reynolds@gmail.com> wrote:

> Hi Ulrich,
> 
> Thank you for your comments on the Data Cube vocabulary.
> 
> This is not yet the formal response to your comments.
> 
> If possible I'd like to ask for some clarification on them to determine what changes the WG will need to make in order to make the specification acceptable to you.
> 
> On 05/04/13 12:13, Ulrich wrote:
>> Dear Editors,
>> 
>> Jeni brought this draft to my attention. After reading the document,
>> here are a few comments/suggestions. Bear in mind that I have little
>> experience in W3C standards, so they may seem obvious or irrelevant.
> 
> [To explain the context this is a "Last Call". The specification has been out in draft form for one year (based on a draft that had been in use for nearly 2 years prior to that). The working group thinks it is done and is asking the community to confirm if they find the specification acceptable.
> 
> If the community review discovers a problem with the specification then it will need to be changed and normally a new Last Call would be issued. In our case the working group is closing very shortly so there will be no chance to take a further run at this. If the current version is not acceptable then the spec will simply progress no further.
> 
> However, we are allowed to make editorial changes if there are aspects of the presentation which are sufficiently unclear to warrant it. We can also remove the sections we flagged as at risk.
> 
> In order to move on to the next stage we have to demonstrate an audit trail that we have either addressed every comment submitted (and that the submitter accepts the outcome) or that we have adequate reasons for not doing so. Which is why this response is going to seem a bit long ....]
> 
>> Overall comment: Data Cube seems to be geared towards official
>> statistics; other audiences may find it harder to grasp. Your first
>> priority are the comments referring to example 5.3 and similar.
>> 
>>  * Link to the SDMX User Guide 2.1, especially /2.2 Background/ eases
>>    understanding for newcomers.
> 
> Given that Data Cube is currently based on the SDMX 2.0 information model, not the 2.1 model, I'd prefer not to link to 2.1 information. Though the 2.0 user guide does not have a directly equivalent section. We will consider what to do here.
> 
>>  * As an applied statistician, in my simple world, I think of
>>    *datasets* in tabular format: rows and columns (+ metadata). E.g.
>>    observers values of individuals  (rows) across characteristics such
>>    as age (columns). Of course, a dataset may also consist of
>>    aggregated data. The point is, if the concept of a dataset is used
>>    in a more general format, it may be misinterpreted.
> 
> Is a specific change required here and if so is that change editorial or are you suggesting that the ontology itself needs to change?
> 
> The terminology of "data set" is in common use in Linked Data, and indeed the broader world, to refer to arbitrary collections of data not simply tabular. The terminology is also used (sadly spelt "DataSet") in the SDMX information model where it is clearly meant to cover aggregate statistics as well as "micro" statistics.
> 
>>  * Make examples earlier.
> 
> Could you be more specific?
> 
> [The first few sections of the document need to be there I think. Section 5 is the start of main part of the specification and introduces the running example 5.3. Sections 7 onwards pretty much start with, and are dominated by examples. So I'm guessing your suggestion primarily applies to section 6. Perhaps that  6.3 needs to be broken up and those examples integrated somehow into 6.1 & 6.2. Is that right? ]
> 
>>  * I'd recommend avoiding the term "non-statistical data" as I have
>>    /only/ heard it in the context of official statistics. Or what
>>    exactly makes data statistical? (see e.g. section 5.1)
> 
> Will look at rephrasing.
> 
> [Though it seems to me that some of the things for which Data Cube has been used (environmental and other sensor measures, weather forecasts, local authority payment records, company accounts) are pretty clearly not statistical data sets whether or not you can define what a statistical data set is :)]
> 
>>  * 2.3 Audience: expand with examples?
> 
> Examples of what?
> 
>>  * Section 5.1 "A set of values for all the dimension components is
>>    sufficient to identify a single observation."  Would that imply
>>    there cannot be two individuals with the same characteristics?
> 
> No.
> 
>>    Or
>>    that such as dataset must include an unique ID even if created
>>    artificially? What about data that is anonymised?
> 
> Yes, you need some way represent each of the dimensions. If your data set comprises values for some measurement on each individual, so that "individual" is one of your dimensions, then yes you will need to be able to encode a value for that dimension. That might be a "real" ID, an anonymized ID or the moral equivalent of the row number in some source data.
> 
>>  * 5.1 may also need more examples, say factor variables such as
>>    gender.
> 
> In section 5.1 the listed example of dimension does include "gender" already.
> 
>> They are usually stored as binaries or 2/1 and come with a
>>    label for female/male.
> 
> Sure and the running example includes gender and shows how in Data Cube one would typically use a skos Concept Scheme to encode the gender, the skos:notation for which would be your binary, or numeric or label encoding.
> 
>>    Make explicit what would be the measure
>>    and attribute components.
> 
> Section 5.1 does list illustrative examples of measures "(e.g. economic activity, population)" and attributes "(e.g. units, multipliers, status)".
> 
>> Or refer to a later section that addresses
>>    the sex.
> 
> OK.
> 
>>  * The slice example is good, but could do with a shorter sentence.
> 
> Specifically which is the sentence that needs to be shortened?
> 
>>  * I'm confused about the use of "metadata" now. Is it metadata about
>>    the whole dataset or/and about a single observations?
> 
> Metadata in the sense of section 9 (dct:publisher, dct:subject etc) apply to the whole dataset.
> 
> Metadata which applies to a single observation and which is needed in order the interpret the observation should be represented using attributes (e.g. units of measure, measurement procedure used etc).
> 
> In normal RDF fashion there is no barrier to also attaching informative metadata to an individual observation but the Data Cube specification makes no particular recommendation about that. However, it is one advantage of encoding data in RDF that you get the ability to identify each "cell" in the data and freely annotate it.
> 
>>  * For practical use example 5.3 is unwieldy; the long format is
>>    arguably more common. [1] What we see here is I'd refer to as a
>>    *data* *table *not dataset. Most statistical programs only read data
>>    in tabular format.
> 
> Example 5.3 is a presentation format in order to allow the reader to understand the data that will be used in the running example.
> 
> It does not mean that the data will necessarily be delivered in that format, though in fact in this specific case it was.
> 
> Agreed that for actually ingesting and processing data then "long format" is much easier to work with. However that's orthogonal to the Data Cube specification which is about how to represent such information in RDF.
> 
> [Off topic: I'm told by my nearest friendly statistician that Stata is particularly good at converting between long and wide formats and that arguably the terminology of "long/wide" originated there.]
> 
>>  * Perhaps include what happens to the metadata in example 5.3 as well
>>    ("StatsWales <http://statswales.wales.gov.uk/index.htm> report
>>    number 003311" etc.)
> 
> The examples in section 9 do apply to the data from 5.3 showing issue date, subject categories and title. The notion of a report number is not part of the recommended core minimum set of metadata and I'd prefer to keep the core if we can.
> 
>>  * Example 5.3 -- /can we have some actual final code in there? /Even
>>    if it anticipates some sections.
> 
> By "final code" do you mean the RDF representation as a Data Cube?
> 
> That would be too long to include in-line but one option might be to put the complete RDF version in an appendix and link to that from section 5.3.
> 
>>  * Example 6.3 I find it hard to see where we define the nested
>>    structure of the data - include reference to example 4 or call it
>>    something more telling than "example".
> 
> OK, will retitle that section.
> 
>>  * Section 7. (before 7.1 - generally I'd avoid sections without
>>    subheaders) Can these definitions come earlier? -- suddenly
>>    explained a lot more.
> 
> Perhaps those definitions (rephrased to take into account Guillaume's comments) could be moved into 5.1 or thereabouts.
> 
>>  * So qb:data*S*et and qb:*D*ata*S*et are different…
> 
> Correct.
> 
> The use of capital-S is, as explained in the document, for consistency with SDMX. It's a pain since most other vocabularies use "dataset" but since that's what SDMX use we decided to accept that pain.
> 
> The convention of having a property start with lower-case and class start with upper-case is so deeply embedded in this community that having pairs of class/property than only differ by initial case is relatively common. There is an extensive parallel thread on this regarding the DCAT vocabulary if you are interested.
> 
> Given that this is a defensible convention and that there are substantial numbers of data sets that have been published using the existing version of the data cube vocabulary then it would take an overwhelmingly good technical reason to change this at this stage.
> 
>>  * Unfortunately, I cannot comment on section 10 and 11.
>> 
>> 
>> Reading this guide, it might be relatively easy to provide a tool
>> (algorithm) which translates a simple and "well-behaved" dataset into a
>> DataCube syntax.
> 
> Indeed. We, and I'm sure others, have various ad hoc tooling for this though I'm not aware of any publicly provided generic Data Cube conversion toolkit.
> 
> 
> To sum up. Would it be right to say that your comments are essentially editorial and that you are not proposing that any change is required to the design itself? So if we can address sufficient of your editorial issues you would be prepared to deem the current specification acceptable?
> 
> Dave
>
Received on Monday, 8 April 2013 16:57:41 UTC