- From: Dave Reynolds <dave.e.reynolds@gmail.com>
- Date: Sat, 06 Apr 2013 18:41:21 +0100
- To: Ulrich <ulrich.atz@theodi.org>
- CC: public-gld-comments@w3.org, Jeni Tennison <jeni@theodi.org>
Hi Ulrich,
Thank you for your comments on the Data Cube vocabulary.
This is not yet the formal response to your comments.
If possible I'd like to ask for some clarification on them to determine
what changes the WG will need to make in order to make the specification
acceptable to you.
On 05/04/13 12:13, Ulrich wrote:
> Dear Editors,
>
> Jeni brought this draft to my attention. After reading the document,
> here are a few comments/suggestions. Bear in mind that I have little
> experience in W3C standards, so they may seem obvious or irrelevant.
[To explain the context this is a "Last Call". The specification has
been out in draft form for one year (based on a draft that had been in
use for nearly 2 years prior to that). The working group thinks it is
done and is asking the community to confirm if they find the
specification acceptable.
If the community review discovers a problem with the specification then
it will need to be changed and normally a new Last Call would be issued.
In our case the working group is closing very shortly so there will be
no chance to take a further run at this. If the current version is not
acceptable then the spec will simply progress no further.
However, we are allowed to make editorial changes if there are aspects
of the presentation which are sufficiently unclear to warrant it. We can
also remove the sections we flagged as at risk.
In order to move on to the next stage we have to demonstrate an audit
trail that we have either addressed every comment submitted (and that
the submitter accepts the outcome) or that we have adequate reasons for
not doing so. Which is why this response is going to seem a bit long ...]
> Overall comment: Data Cube seems to be geared towards official
> statistics; other audiences may find it harder to grasp. Your first
> priority are the comments referring to example 5.3 and similar.
>
> * Link to the SDMX User Guide 2.1, especially /2.2 Background/ eases
> understanding for newcomers.
Given that Data Cube is currently based on the SDMX 2.0 information
model, not the 2.1 model, I'd prefer not to link to 2.1 information.
Though the 2.0 user guide does not have a directly equivalent section.
We will consider what to do here.
> * As an applied statistician, in my simple world, I think of
> *datasets* in tabular format: rows and columns (+ metadata). E.g.
> observers values of individuals (rows) across characteristics such
> as age (columns). Of course, a dataset may also consist of
> aggregated data. The point is, if the concept of a dataset is used
> in a more general format, it may be misinterpreted.
Is a specific change required here and if so is that change editorial or
are you suggesting that the ontology itself needs to change?
The terminology of "data set" is in common use in Linked Data, and
indeed the broader world, to refer to arbitrary collections of data not
simply tabular. The terminology is also used (sadly spelt "DataSet") in
the SDMX information model where it is clearly meant to cover aggregate
statistics as well as "micro" statistics.
> * Make examples earlier.
Could you be more specific?
[The first few sections of the document need to be there I think.
Section 5 is the start of main part of the specification and introduces
the running example 5.3. Sections 7 onwards pretty much start with, and
are dominated by examples. So I'm guessing your suggestion primarily
applies to section 6. Perhaps that 6.3 needs to be broken up and those
examples integrated somehow into 6.1 & 6.2. Is that right? ]
> * I'd recommend avoiding the term "non-statistical data" as I have
> /only/ heard it in the context of official statistics. Or what
> exactly makes data statistical? (see e.g. section 5.1)
Will look at rephrasing.
[Though it seems to me that some of the things for which Data Cube has
been used (environmental and other sensor measures, weather forecasts,
local authority payment records, company accounts) are pretty clearly
not statistical data sets whether or not you can define what a
statistical data set is :)]
> * 2.3 Audience: expand with examples?
Examples of what?
> * Section 5.1 "A set of values for all the dimension components is
> sufficient to identify a single observation." Would that imply
> there cannot be two individuals with the same characteristics?
No.
> Or
> that such as dataset must include an unique ID even if created
> artificially? What about data that is anonymised?
Yes, you need some way represent each of the dimensions. If your data
set comprises values for some measurement on each individual, so that
"individual" is one of your dimensions, then yes you will need to be
able to encode a value for that dimension. That might be a "real" ID, an
anonymized ID or the moral equivalent of the row number in some source data.
> * 5.1 may also need more examples, say factor variables such as
> gender.
In section 5.1 the listed example of dimension does include "gender"
already.
> They are usually stored as binaries or 2/1 and come with a
> label for female/male.
Sure and the running example includes gender and shows how in Data Cube
one would typically use a skos Concept Scheme to encode the gender, the
skos:notation for which would be your binary, or numeric or label encoding.
> Make explicit what would be the measure
> and attribute components.
Section 5.1 does list illustrative examples of measures "(e.g. economic
activity, population)" and attributes "(e.g. units, multipliers, status)".
> Or refer to a later section that addresses
> the sex.
OK.
> * The slice example is good, but could do with a shorter sentence.
Specifically which is the sentence that needs to be shortened?
> * I'm confused about the use of "metadata" now. Is it metadata about
> the whole dataset or/and about a single observations?
Metadata in the sense of section 9 (dct:publisher, dct:subject etc)
apply to the whole dataset.
Metadata which applies to a single observation and which is needed in
order the interpret the observation should be represented using
attributes (e.g. units of measure, measurement procedure used etc).
In normal RDF fashion there is no barrier to also attaching informative
metadata to an individual observation but the Data Cube specification
makes no particular recommendation about that. However, it is one
advantage of encoding data in RDF that you get the ability to identify
each "cell" in the data and freely annotate it.
> * For practical use example 5.3 is unwieldy; the long format is
> arguably more common. [1] What we see here is I'd refer to as a
> *data* *table *not dataset. Most statistical programs only read data
> in tabular format.
Example 5.3 is a presentation format in order to allow the reader to
understand the data that will be used in the running example.
It does not mean that the data will necessarily be delivered in that
format, though in fact in this specific case it was.
Agreed that for actually ingesting and processing data then "long
format" is much easier to work with. However that's orthogonal to the
Data Cube specification which is about how to represent such information
in RDF.
[Off topic: I'm told by my nearest friendly statistician that Stata is
particularly good at converting between long and wide formats and that
arguably the terminology of "long/wide" originated there.]
> * Perhaps include what happens to the metadata in example 5.3 as well
> ("StatsWales <http://statswales.wales.gov.uk/index.htm> report
> number 003311" etc.)
The examples in section 9 do apply to the data from 5.3 showing issue
date, subject categories and title. The notion of a report number is not
part of the recommended core minimum set of metadata and I'd prefer to
keep the core if we can.
> * Example 5.3 -- /can we have some actual final code in there? /Even
> if it anticipates some sections.
By "final code" do you mean the RDF representation as a Data Cube?
That would be too long to include in-line but one option might be to put
the complete RDF version in an appendix and link to that from section 5.3.
> * Example 6.3 I find it hard to see where we define the nested
> structure of the data - include reference to example 4 or call it
> something more telling than "example".
OK, will retitle that section.
> * Section 7. (before 7.1 - generally I'd avoid sections without
> subheaders) Can these definitions come earlier? -- suddenly
> explained a lot more.
Perhaps those definitions (rephrased to take into account Guillaume's
comments) could be moved into 5.1 or thereabouts.
> * So qb:data*S*et and qb:*D*ata*S*et are different…
Correct.
The use of capital-S is, as explained in the document, for consistency
with SDMX. It's a pain since most other vocabularies use "dataset" but
since that's what SDMX use we decided to accept that pain.
The convention of having a property start with lower-case and class
start with upper-case is so deeply embedded in this community that
having pairs of class/property than only differ by initial case is
relatively common. There is an extensive parallel thread on this
regarding the DCAT vocabulary if you are interested.
Given that this is a defensible convention and that there are
substantial numbers of data sets that have been published using the
existing version of the data cube vocabulary then it would take an
overwhelmingly good technical reason to change this at this stage.
> * Unfortunately, I cannot comment on section 10 and 11.
>
>
> Reading this guide, it might be relatively easy to provide a tool
> (algorithm) which translates a simple and "well-behaved" dataset into a
> DataCube syntax.
Indeed. We, and I'm sure others, have various ad hoc tooling for this
though I'm not aware of any publicly provided generic Data Cube
conversion toolkit.
To sum up. Would it be right to say that your comments are essentially
editorial and that you are not proposing that any change is required to
the design itself? So if we can address sufficient of your editorial
issues you would be prepared to deem the current specification acceptable?
Dave
Received on Saturday, 6 April 2013 17:41:56 UTC