Re: [TMO] patient record normalization from M. Scott Marshall on 2010-09-10 (public-semweb-lifesci@w3.org from September 2010)

From: M. Scott Marshall <mscottmarshall@gmail.com>
Date: Fri, 10 Sep 2010 23:42:02 +0200
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: Chimezie Ogbuji <ogbujic@ccf.org>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>, Michel_Dumontier <Michel_Dumontier@carleton.ca>
Message-ID: <AANLkTin=0nYfNuO6RBYq6b0eWm-t5bG5Vj2p4M8DUaZF@mail.gmail.com>

Hi Eric,

The business of standardizing units reminds me of:

http://science.nasa.gov/science-news/science-at-nasa/2007/08jan_metricmoon/
followed by:
http://news.bbc.co.uk/2/hi/science/nature/462264.stm

For me, the story of losing an orbiter because of an accidental clash
between imperial and metric units was a poster child for Semantic Web,
as well as the problem you describe. You see, the machines will never
know what the numbers mean unless we use a Semantic layer as well as a
syntactic layer. The problem with units is that they seem to be
somehow both semantic and syntactic, somewhere in between.

Hard as I try, I don't understand why you want to change the way that
you describe data to constrain the data that is being described. Well,
actually I do. You want to force anyone annotating or publishing data
in the TMO vocabulary to use a single set of units (right?). It could
be an effective way to achieve the goal but it seems rather heavy
handed. Overloading a predicate and adding English parameters to it
might make the requirements obvious to people that they're only
supposed to use your units (because you provide no others) when they
use your  ontology but it doesn't solve the problem. Yes,
normalization of units is necessary in order to integrate data. But
the problem of normalization won't go away if you glob two semantic
aspects together in the *description of the data* (i.e. blood pressure
measurement type and units). I see from your language that you think
that it will force users to "inject" data into the data model with the
preferred units when publishing data in the TMO vocabulary but doesn't
this just point to the processing that is unavoidable for
integrating/comparing data? We will always need to get data into the
same units in order to integrate it. I feel your pain as you try to
solve it in SPARQL (and I see that it can be a very real problem), but
I think there must be a better way than to overload a predicate and
thereby obfuscate the data model. If nothing else, let's depend on
consistency checks and good documentation, as already suggested. We
can't expect to accomplish *everything* in SPARQL.

Actually, isn't this a data publishing issue? If someone publishes
systolic blood pressure values as linked data using TMO, shouldn't
they refer to the TMO ontology and the units that they used in the
provenance of the named graph containing it? If we know from the
provenance about the named graph that it uses TMO [<graphURI>
void:usesVocabulary TMO] and MmHg [bloodPressureMeasurements hasUnits
MmHg] to describe blood pressure, then we can use that information in
order to pre-select the graph during federation (in a world of
abundance and sloppy units). In this way, we could automatically
convert values as needed, presumably based on conversions that derive
from the unit ontology (non?). Although such a software feat might
require coding or reasoning outside SPARQL, it already does.

Clear tagging of the data with units should be a best practice in and
outside the Semantic Web. I am in favor of a two component approach,
complemented by good provenance practice.

-Scott

On Fri, Sep 10, 2010 at 10:30 PM, Michel_Dumontier
<Michel_Dumontier@carleton.ca> wrote:
>
>> But then anyone merging two TMO documents with different units has the
>> normalization burden. If we pick a unit and annotate the predicates,
>> then the folks who would have to do the work of merging with non-TMO
>> documents (who would have to introduce some rules/canonicalization
>> pipeline anyways) have the OWL hooks to automate that merging.
>
> Again, if we are considering TMO, then we can impose a restriction to specify the unit - we can also make this clear in documentation relating to the measurements with units.
>
>> > Also, having domain-independent predicates makes it easier to render
>> a view
>> > of the data (for human consumption) that includes visual cues
>> regarding the
>> > units of measures associated with values directly from the data since
>> such
>> > tools will always expect the same set of terms to capture a value and
>> its
>> > unit of measurement.
>>
>> If you've bought the argument for early normalization, isn't it
>> needlessly dangerous to offer the freedom to express BP in mmHg in an
>> ontology that's required to have BP in MPa? It does put more burden on
>> the use of generic data browsers (they'd have to read the OWL in order
>> to present the user with units), but I think that use case is small
>> compared to the cost of data consumption.
>
> I don't think we should tailor our data model to generic data browsers - they are far too simple for the complex knowledge that we have to represent.
>
> m.

Received on Friday, 10 September 2010 21:42:31 UTC