Re: Absence of mention of units of measure for columns is very surprising

> On Sep 18, 2015, at 7:52 AM, Simon Cox <dr.shorthair@gmail.com> wrote:
> 
> I am involved with the Research Data Alliance activity on Data Types and Registries. 
> The goal of this is to 
> (i) develop a format/model for the description of the structure of datasets
> (ii) allow the descriptions to be registered, so they can be referred to. 
> kinda like enhanced MIME-types, so that client applications know what's inside a dataset, not just the file format. 
> A prototype has already been developed by CNRI, with a test deployment. 
> 
> There is clearly a significant shared concern with CSV on the web, so in preparation for meetings next week I consulted the Candidate Specs, particularly the "Model for Tabular Data and Metadata on the Web". I have not read the full suite of documents in detail, but was surprised to find that 'units of measure' is not mentioned in the set of 'core annotations' for columns http://www.w3.org/TR/tabular-data-model/#columns (in most tables data in a single column will have a common unito of measure). 
> 
> I raised this with Jeremy, and he showed me the route which can be followed, by adding a column or traversing through the QB vocabulary. 
> However, this is complicated, and not made immediately available or even flagged in the text. 
> I strongly suggest 
> (i) at least alerting readers to how this very common requirement can be managed
> (ii) better still, consider adding uom as a standard column annotation. 

Just my perspective, but I think the issue is that there is no one standard way of describing units in RDF data. As the basic data model used by CSV on the Web closely corresponds to RDF, the fact that literal values extracted from CSV cells don’t have more dimensions is related to this underlying lack of a data model for describing data with units.

Searching for this indicated a couple of different ways to handle it:

* Define an OWL datatype which describes the values with units (see http://stackoverflow.com/questions/20248369/units-of-measurement-in-owl-and-rdf)

unit:megaPascal rdf:type   rdfs:datatype ;
                rdfs:label "MPa" .

unit:Pascal rdf:type   rdfs:datatype ;
                rdfs:label "Pa" .

:AlMg3 prop:hasTensileStrength "300"^^unit:megaPascal .
:AlMg3 prop:hasYieldStrength   "2"^^unit:Pascal .

QUDT (http://www.openphacts.org/specs/units/) also describes similar methods.

CSVW already supports this by allowing an arbitrary datatype using the @id field on a datatype (see http://www.w3.org/TR/tabular-metadata/#datatypes).

@id If included, @id is a link property that identifies the datatype described by this datatype description. The value of this property becomes the id annotation for the described datatype. It must not start with _: and it must not be the URL of a built-in datatype.

* Use a structured value to represent the data, for example:

:AIMg3 prop:hasTensileStrength [rdf:value 300, ex:units unit:MegaPascal] .

This can be supported using the virtual columns feature, which allows relationships to be created and allocate columns to different values. This might also be useful when the units varied on each row.

I think describing a use case for this, and using this as an informative example in one of the documents, or a primer would be a good way to approach this right now. As common practice emerges, this could be incorporated into a future version of these specs, but this should be done in harmony with describing a standard way of describing dimensional data in RDF and JSON.

Gregg

> Simon Cox
> CSIRO, co-convenor of RDA Data Types activity. 

Received on Monday, 21 September 2015 15:19:44 UTC