Re: [Data Cubes] Why this kind of Data Structure Definition

Hi Thomas,

There are several reasons behind the DSD (Data Structure Definition) 
approach.

1. One of the original design criteria for Data Cube was compatibility 
with SDMX (at least the core information model) and the notion and 
terminology of DSDs comes from SDMX.  Since it works well in that world 
and many users of Data Cube have some familiarity with SDMX then that 
compatibility is helpful.

2. It is useful to have a single place where the structure of a cube is 
defined. Many users publish multiple cubes with the same shape (e.g. 
same statistics but covering new years or different regions). Having a 
single resource (URI) which defines that structure makes it trivial for 
publishers to reuse structure definitions and for consumers to check it 
is the same structure. The DSD design achieves that quite neatly.

3. For several of the use cases for Data Cube the data consumer needs to 
be able to easily inspect the structure - to search for cubes with a 
given shape, to decide how to present cubes etc. The simplicity of the 
DSD structure (in contrast to an equivalent OWL specification) 
facilitates that.

Actually this last point also applies to human inspection. One of the 
pieces of feedback we've had from users of the Data Cube is that it is 
the DSD which is one of the most appealing features. Having a compact, 
readable statement of the dimensions, measures and attributes in a cube 
makes it easy to quickly understand the shape of the data.

Some comments in line ...

On 15/08/12 07:56, Thomas Bandholtz wrote:
> meanwhile I have some understanding of Data Cubes.
> What makes it difficult to understand is the specific Data Structure
> Definition pattern.
>
> The “Data Structure Definition” of a “Data Set” links to the set of
> “Component Specifications” which describe “Dimension-“ or “Measure
> Properties” that will be a properties of the “Data Set”.
>
> All this could be expressed by making subclasses of cubes:DataSet the
> domain of the same dimension and measure properties.

Not all but much of it could indeed be expressed using subclasses. 
However, it does get verbose. For each cube you also need to subclass 
qb:Slice and qb:Observation as well. Then you need OWL restrictions to 
relate these so that your qb:DataSet subclass so it only has the right 
qb:Slice and qb:Observation subclasses. Since there can be several 
different qb:SliceKeys then you need correspondingly many different 
qb:Slice subclasses and corresponding unionOfs to tie those back 
together gain.

Perfectly possible but means that the structure definition is rather 
distributed (see #2 above) and less compact and much less easy to 
inspect (see #3 above).

The aspects that aren't expressible in OWL (the ordering of dimensions, 
attachment levels, the measureDimension role) could be expressed using 
AnnotationProperties. Again perfectly possible, but at that point you 
have a mix of a standard mechanism (OWL) and some custom machinery and 
thus still need to do some custom handling.

> cubes:ComponentSpecification adds a cardinality choice (0-1 or 1) and
> (optionally) a specific order, and cubes:DimensionProperty can specify a
> cubes:codeList for the values of this property.
>
> Cardinality can be handled by OWL,

Sure.

> RDF can describe ordered lists,

Yes and the initial design used that but it proved problematic.

At one point it was proposed that the DSD should be a list of 
ComponentProperties so that the ordering was clear.

The problem with that query of RDF lists is tricky (though the advent of 
SPARQL 1.1 has alleviated that somewhat).  This was especially annoying 
because the majority of cubes don't specify an order so complicating 
access in the general case to cater for minority cases was distasteful.

We also discussed a dual approach where each of the ComponentProperties 
was directly attached and then the dimensions also attached as an 
ordered list but that repetition was unacceptable.

The current qb:order approach, while a bit ugly, means that the common 
unordered case is trivial and also has the benefit of supporting partial 
orders. A common use case is to want to put one dimension (e.g. time) 
first (to indicate this data should displayed as a time series) but to 
not care about other dimensions. The current design caters for that with 
less complexity than representing partial orders in lists.

> and
> the cubes:codeList value can simply be the range of the dimension property.

It's not quite that easy. To do that for coded properties you need to 
create adopt a particular design pattern for how to use SKOS.

The value of the DimensionProperty is the skos:Concept not the 
skos:ConceptScheme. So to use rdfs:range you have to introduce a 
subclass of skos:Concept for each concept scheme.

In the data.gov.uk work that design pattern for SKOS use is strongly 
encouraged so in that case we can use rdfs:range. We also adopted that 
pattern in creating the RDF rendering of the SDMX code lists. However, 
the bulk of external SKOS vocabularies we come across don't use that 
pattern which makes it harder to use them "out of the box" with an 
rdfs:range approach.

It also makes discovering the code list a little trickier (#3 again). 
The relationship between the subclass of skos:Concept and the associated 
skos:ConceptScheme can be expressed as an owl:hasValue restriction on 
skos:inScheme. To query for that information you need everyone to follow 
that pattern (and even data.gov.uk didn't go as far as requiring those 
hasValue restrictions) and it's not an intuitive query.

So again a single annotation property enables easy discovery and 
inspection of the codeList while not precluding use of rdfs:range over 
subclasses of skos:Concept for people (like us) who are happy with that 
pattern of SKOS use.

> The specification does not give any reason why they invent all this
> instead of expressing the same with basic RDFS/OWL patterns.

Hopefully I've given some insight into which this isn't invention such 
much as adopting SDMX ideas (which is explicitly referenced in the 
specification as the basis of the approach) and that the alternative is 
not *basic* RDFS/OWL so much as relatively sophisticated use of OWL.

A single, compact, declarative statement of the structure does seem to 
have proved appealing and useful in practice.

All that said it might be an interesting exercise for someone to write a 
compiler to convert a DSD in the corresponding set of OWL (for those 
bits of the DSD where that's possible). That would allow you do things 
like cube structure validation using an OWL closed world checker. 
[Though compiling a DSD into SPARQL is easier and likely to be a more 
effective validation solution.]

Dave

Received on Wednesday, 15 August 2012 09:00:50 UTC