Re: comments on WD-vocab-data-cube-20130312

Hi Bill,

Many thanks for your feedback on Data Cube.

Some responses in line ...

On 05/04/13 09:44, Bill Roberts wrote:
> I've reviewed the Data Cube working draft and just to say that overall I
> think it is excellent: I have been using earlier versions of this for
> some time and it is great to see it going through the W3 standardisation
> process.  It is one of the most important vocabularies for linked data
> in a government context.

Thanks for the endorsement.

As we move through the next stage of the process then we'll need to 
gather evidence of use, so any specific pointers to existing usage would 
be very helpful.

[<WG hat off> I'm obviously aware of some of your use of it but your 
suggestions on examples we should reference would be great.]

> I have a few specific comments, mainly on the newer features.  Apologies
> that I haven't kept track of these through the drafting process.  I can
> see that commenting earlier could have been more useful.
>
> Sections 8.2 and 8.3 on hierarchical and non-SKOS code lists are a very
> useful addition addressing a commonly experienced requirement for
> modelling data cubes with a geographical element.

Good.

> Section 9 has some overlap with VOID and DCAT, but is probably worth
> having separately in the DataCube document as well.  The core set of
> metadata terms seems a good choice.

Good.

> When data cube datasets appear in data catalogues, it can be useful if
> the geographical and spatial coverage and granularity can be identified.
>   In the case of data cubes, this can often be deduced in an automated
> way, but may be involved, so identifying these aspects in metadata may
> be useful.  Again, this issue has some overlap with DCAT, but it is
> particularly relevant for data cube datasets.
>
> For example, it is useful to say that the spatial coverage of a dataset
> is England and the granularity is  (say) 'Local Authority Distict'.  The
> time coverage is 2001-2012 and the temporal granularity is quarterly.
>
> Would it be useful to have some optional but standard metadata terms to
> describe these?  Perhaps with a standardised relationship to the range
> of relevant dimension properties.

I quite agree these would be useful. For spatial coverage there is 
always dct:spatial but I'm not aware of a really good existing solution 
for granularity.

Sadly the working group is coming to an end so it will not be possible 
to add additional metadata properties directly to Data Cube on this round.

It would always be possible to create additional metadata vocabularies 
for use with Data Cube (and indeed DCAT).

> Section 10:  I have generally found it most useful/convenient to use the
> verbose normalized form of data cube.
>
> I have no objections to the abbreviated form, but note that due to the
> high level of repetition, normalized data cube datasets can be
> efficiently compressed by the usual compression algorithms, if
> transmission time or offline storage space is an issue.  I would not
> anticipate using the normalization algorithm.

Acknowledged.

[<WG hat off> I agree with your comments on data compression being 
effective, we have quantitative evidence of that from the work on 
representing weather forecasts by Data Cube. I think this is partly a 
perception issue. When people first see the redundancy they get worried. 
Having an apparently more compact option, similar to the SDMX encoding, 
may reduce barriers to uptake even if the technical need for it is limited.

Though there are cases like "units of measure" which often apply to the 
whole cube and it can sometimes be clearer (and straightforward) to put 
those in abbreviated form.]

> 11. Well-formed cubes
> I see the value of having a definition of a well-formed cube with
> associated automatable checks, but I wonder if the list is perhaps too
> long: i.e. some aspects of this might be better as optional enrichments
> to the data cube rather than something that is seen as 'required'.
>
> In an environment where we still have to work hard to persuade people to
> climb the learning curve of linked data, we should be wary of making the
> curve even steeper.
> Of course a publisher could choose to publish non-well-formed cubes
> which could still be used in most circumstances, but this section could
> lead to those approaches being seen as 'wrong'.

Understood.

> Are there specific use cases in mind driving the well-formedness
> criteria, eg for data cube processing toolsets? (I can imagine there is,
> but not sure exactly what the authors have in mind.)

Yes exactly, in order to be able to create reliable tools to transform, 
visualize or otherwise consume Data Cubes then then you need some 
constraints on what you can expect in the data. The well-formedness 
criteria are our attempt to do that.

I also think in some situations it is useful to have consistency checks 
you can apply at publication time which might reveal errors in the data 
processing.

> IC-2.  I have generally found little use for data structure definitions,
> as when using a normalized form of the data cube, the data structure is
> implicitly but clearly available in each observation. It doesn't
> particularly help someone writing SPARQL queries against the data. I
> would prefer if a data structure definition could be viewed as optional,
> but if others are using these regularly (for example to support data
> cube viewing software?), then I can see the benefit of consistently
> including a DSD.  It's obviously not difficult to do.

I do think that retaining the requirement to have a DSD is important.

The fact that there is a well-defined way you can discover the data set, 
data structure and metadata that applies to any observation is an 
important "selling" feature for Data Cube. It provides context for data 
as well as encouraging you to explicitly state whole-dataset attributes 
that might get omitted from the observation level.

I agree that from an RDF point of view there is a lot you can do if you 
just having sample data but for automated consumption of cubes 
(especially for visualization) then having the explicit discoverable 
structure is important.

I also think there is also sufficiently strong expectation from SDMX 
that you have a DSD that making it optional in Data Cube might trigger 
adverse reaction.

> IC-4 and IC-5.  It's common to use 'standard' dimensions and codelists
> from the SDMX/RDF vocabularies that have been developed alongside the
> data cube work (though I understand that these are not part of the
> working draft currently being reviewed). I have used
> sdmx-dimension:refArea and sdmx:dimension:refPeriod extensively and
> indeed these are used throughout this document in the examples.  As far
> as I'm aware sdmx-dimension.ttl does not declare a range for these.  The
> requirement to have an explicit range would mean that it is always
> necessary to declare one's own subproperties for refArea and refPeriod.
>   Perhaps that is a good thing of course, as it allows conclusions to be
> drawn about the values of those dimensions in a dataset.

Creating sub-properties with specific range is recommended best practice 
but I understand the benefit of just reusing them directly.

> In some cases, being less specific could be useful.  Using the standard
> properties offers some gain in interoperability and readability.  This
> issue could be solved by declaring a (fairly generic) range for refArea
> and refPeriod I suppose.

Good point.

The sdmx-* vocabularies are not a part of the W3C Data Cube spec so we 
can update those without invalidating the Last Call.

If we added an explicit but completely generic range for refArea and 
refPeriod that would allow them to be used in well-formed cubes without 
directly endorsing that as recommended practice.

Given that change (which is it outside of the GLD work and the Last Call 
process), and given comments on your other points, are you happy with 
this response?

Dave
W3C GLD Working Group

Received on Friday, 5 April 2013 11:17:09 UTC