- From: Dave Reynolds <dave.e.reynolds@gmail.com>
- Date: Wed, 15 Aug 2012 10:00:20 +0100
- To: public-gld-comments@w3.org
Hi Thomas, There are several reasons behind the DSD (Data Structure Definition) approach. 1. One of the original design criteria for Data Cube was compatibility with SDMX (at least the core information model) and the notion and terminology of DSDs comes from SDMX. Since it works well in that world and many users of Data Cube have some familiarity with SDMX then that compatibility is helpful. 2. It is useful to have a single place where the structure of a cube is defined. Many users publish multiple cubes with the same shape (e.g. same statistics but covering new years or different regions). Having a single resource (URI) which defines that structure makes it trivial for publishers to reuse structure definitions and for consumers to check it is the same structure. The DSD design achieves that quite neatly. 3. For several of the use cases for Data Cube the data consumer needs to be able to easily inspect the structure - to search for cubes with a given shape, to decide how to present cubes etc. The simplicity of the DSD structure (in contrast to an equivalent OWL specification) facilitates that. Actually this last point also applies to human inspection. One of the pieces of feedback we've had from users of the Data Cube is that it is the DSD which is one of the most appealing features. Having a compact, readable statement of the dimensions, measures and attributes in a cube makes it easy to quickly understand the shape of the data. Some comments in line ... On 15/08/12 07:56, Thomas Bandholtz wrote: > meanwhile I have some understanding of Data Cubes. > What makes it difficult to understand is the specific Data Structure > Definition pattern. > > The “Data Structure Definition” of a “Data Set” links to the set of > “Component Specifications” which describe “Dimension-“ or “Measure > Properties” that will be a properties of the “Data Set”. > > All this could be expressed by making subclasses of cubes:DataSet the > domain of the same dimension and measure properties. Not all but much of it could indeed be expressed using subclasses. However, it does get verbose. For each cube you also need to subclass qb:Slice and qb:Observation as well. Then you need OWL restrictions to relate these so that your qb:DataSet subclass so it only has the right qb:Slice and qb:Observation subclasses. Since there can be several different qb:SliceKeys then you need correspondingly many different qb:Slice subclasses and corresponding unionOfs to tie those back together gain. Perfectly possible but means that the structure definition is rather distributed (see #2 above) and less compact and much less easy to inspect (see #3 above). The aspects that aren't expressible in OWL (the ordering of dimensions, attachment levels, the measureDimension role) could be expressed using AnnotationProperties. Again perfectly possible, but at that point you have a mix of a standard mechanism (OWL) and some custom machinery and thus still need to do some custom handling. > cubes:ComponentSpecification adds a cardinality choice (0-1 or 1) and > (optionally) a specific order, and cubes:DimensionProperty can specify a > cubes:codeList for the values of this property. > > Cardinality can be handled by OWL, Sure. > RDF can describe ordered lists, Yes and the initial design used that but it proved problematic. At one point it was proposed that the DSD should be a list of ComponentProperties so that the ordering was clear. The problem with that query of RDF lists is tricky (though the advent of SPARQL 1.1 has alleviated that somewhat). This was especially annoying because the majority of cubes don't specify an order so complicating access in the general case to cater for minority cases was distasteful. We also discussed a dual approach where each of the ComponentProperties was directly attached and then the dimensions also attached as an ordered list but that repetition was unacceptable. The current qb:order approach, while a bit ugly, means that the common unordered case is trivial and also has the benefit of supporting partial orders. A common use case is to want to put one dimension (e.g. time) first (to indicate this data should displayed as a time series) but to not care about other dimensions. The current design caters for that with less complexity than representing partial orders in lists. > and > the cubes:codeList value can simply be the range of the dimension property. It's not quite that easy. To do that for coded properties you need to create adopt a particular design pattern for how to use SKOS. The value of the DimensionProperty is the skos:Concept not the skos:ConceptScheme. So to use rdfs:range you have to introduce a subclass of skos:Concept for each concept scheme. In the data.gov.uk work that design pattern for SKOS use is strongly encouraged so in that case we can use rdfs:range. We also adopted that pattern in creating the RDF rendering of the SDMX code lists. However, the bulk of external SKOS vocabularies we come across don't use that pattern which makes it harder to use them "out of the box" with an rdfs:range approach. It also makes discovering the code list a little trickier (#3 again). The relationship between the subclass of skos:Concept and the associated skos:ConceptScheme can be expressed as an owl:hasValue restriction on skos:inScheme. To query for that information you need everyone to follow that pattern (and even data.gov.uk didn't go as far as requiring those hasValue restrictions) and it's not an intuitive query. So again a single annotation property enables easy discovery and inspection of the codeList while not precluding use of rdfs:range over subclasses of skos:Concept for people (like us) who are happy with that pattern of SKOS use. > The specification does not give any reason why they invent all this > instead of expressing the same with basic RDFS/OWL patterns. Hopefully I've given some insight into which this isn't invention such much as adopting SDMX ideas (which is explicitly referenced in the specification as the basis of the approach) and that the alternative is not *basic* RDFS/OWL so much as relatively sophisticated use of OWL. A single, compact, declarative statement of the structure does seem to have proved appealing and useful in practice. All that said it might be an interesting exercise for someone to write a compiler to convert a DSD in the corresponding set of OWL (for those bits of the DSD where that's possible). That would allow you do things like cube structure validation using an OWL closed world checker. [Though compiling a DSD into SPARQL is easier and likely to be a more effective validation solution.] Dave
Received on Wednesday, 15 August 2012 09:00:50 UTC