Re: [TMO] patient record normalization from Chimezie Ogbuji on 2010-09-11 (public-semweb-lifesci@w3.org from September 2010)

From: Chimezie Ogbuji <ogbujic@ccf.org>
Date: Sat, 11 Sep 2010 13:16:53 -0400
To: "Eric Prud'hommeaux" <eric@w3.org>
cc: Michel_Dumontier <Michel_Dumontier@carleton.ca>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
Message-ID: <C8B131C5.135FC%ogbujic@ccf.org>
On 9/10/10 4:08 PM, "Eric Prud'hommeaux" <eric@w3.org> wrote:
>> ..snip ..
>> I'm not so sure if the idea that databases with measurement data are likely
>> to have mixed units is very compelling in the realm of patient data.
>> Patient data is more than often local to a particular institution and their
>> conventions and so I would think that it is more likely that you will find a
>> more homogenous combination of units (where BP, for instance is primarily
>> measured in one unit or another depending on the institution).  Certainly,
>> if you have an integrated dataset this assumption is less likely to hold,
>> but even then 1) I don't think the range of units you are likely to see in
>> such combined datasets will be that diverse - international conventions not
>> withstanding - and 2) Normalization into a canonical set (Vipul's
>> suggestion) seems a reasonable approach to adopt as part of the integration,
>> rather than to delay this normalization until the point when you query the
>> the data (making the query and any reasoning involved more complex).
> 
> I agree that databases within an institution are likely to be
> homogeneous, at least for any given epoch, and even merging
> international sources, four blood pressure units is pretty small,
> 
> but this represents an enormous increase in complexity in a query
> (five-fold increase in query size for the sample monitoring query at
> the bottom).

The source of the complexity is inevitable (in either case) if you don't
know what units are being used beforehand and you will need a disjunction
for every combination as in your use of UNIONs below, but as was said later
in the thread, you can easily make appropriate constraints in the ontology
so it is clear which units are being used.

> I, and perhaps Vipul, am/are advocating canonicalization on the way
> into the standardized structure as opposed to at use time. The former
> makes querying and rules simpler and makes query federation tractable
> with conventional tooling. (Use-time normalization would require set
> of rules run over the data, which I believe reduces the distinction
> between SemWeb data and conventional use of data dumps.)

Yes, I agree with this. *If* you do decide to normalize your units then you
certainly want to do it before hand *and* you want to provide some
indication as to what those canonical units are.

>> Personally, the idea of 'embedding' the units into the predicate doesn't
>> appeal to me, mainly because the predicate (for example):
>> 
>> trans:systolicMmHg
>> 
>> is overloaded to capture the meaning associated with systolic pressure *and*
>> the particular unit in which it was captured or represented.  The former is
>> ontological and the latter is epistemic.  However, the more practical issue
>> is that the set of terms for measurement will grow (quite rapidly) with the
>> number of different units you want to be able to represent in your dataset.

> I don't see the growth, given that for the terms I'd like to sanitize,
> I specifically want only one unit. That is, there'd be only :systolicMPa.

If you are normalizing, then yes you will only have one such predicate in
each case, but then every dataset with a canonical combination of units for
a particular kind of measure will have its own single term that captures the
semantics of both what is measured and the units used and this is not very
interoperable from the perspective of a person who wants to ask the same
query from one dataset to the next.

> Yes, and my argument for representation is only relevent if we normalize.

Okay. My misunderstanding.  I thought they were being discussed separately.
 
>> I'm not sure I follow this rational.  If you implement normalization in this
>> way and with such (overloaded) predicates, then determining the relationship
>> between the value and its unit is now a reasoning problem (i.e., you need to
>> 'interpret' the predicate WRT the ontology to determine the appropriate
>> units).  It just seems more straight forward to have generic,
>> domain-independent predicates that directly relate a 'quality value' with
>> its units and scalar value, transformations and normalizations can then
>> happen at the point when data is being integrated, and the semantics of the
>> measured value is still understood.
> 
> But then anyone merging two TMO documents with different units has the
> normalization burden.
> If we pick a unit and annotate the predicates,
> then the folks who would have to do the work of merging with non-TMO
> documents (who would have to introduce some rules/canonicalization
> pipeline anyways) have the OWL hooks to automate that merging.

Yes, this is true.  So, I guess the trade off is ease of merging versus
interoperable querying.

>> Also, having domain-independent predicates makes it easier to render a view
>> of the data (for human consumption) that includes visual cues regarding the
>> units of measures associated with values directly from the data since such
>> tools will always expect the same set of terms to capture a value and its
>> unit of measurement.
> 
> If you've bought the argument for early normalization, isn't it
> needlessly dangerous to offer the freedom to express BP in mmHg in an
> ontology that's required to have BP in MPa?

There is no freedom if the ontology expresses this requirement and is used
to interpret the data:  the RDF graph wouldn't be satisfiable.

> It does put more burden on
> the use of generic data browsers (they'd have to read the OWL in order
> to present the user with units), but I think that use case is small
> compared to the cost of data consumption.

So, this is probably not salient to the main thrust of this thread, but
hints at a greater problem I have with Linked Data and the tension between
the use of ontologies (and rules) to interpret data versus relying on HTTP
interactions for the same purpose.  If the data (and the domain it reflects)
is rich in meaning, then logical interpretation is the more appropriate tool
to 'understand the data'.  Otherwise, relying on URI lookup is probably more
appropriate for use with generic browsers.  Another way of thinking about it
is the difference between 'informed introspection' v.s. 'follow your nose'.

Generic data browsers will always have this burden when faced with
ontologically rich data, since they are (by definition) generic and by
generic, I take it you mean they don't have or use a reasoner.

>> Unless I have misunderstood you, it sounds like you think that the use of
>> muo:measuredIn and muo:numericalValue *requires* input normalization.  I
>> don't think this is the case.  These predicates say nothing about whether or
>> not the use of units are homogeneous or not.
> 
> I don't believe that assertions require normalization, but I think
> that practical querying and inference, particularly federated query,
> require normalization. And if that's the form we want people to
> "publish", why provide a structure for the non-normalized form?

The use of muo:measuredIn and muo:numericalValue is orthogonal to
normalization.  What I understood you and Vipul to mean by unit
normalization is to say: for a particular measurement (BP in this case), we
will always use this particular unit in our dataset.  The use of
muo:measuredIn and muo:numericalValue doesn't restrict whether or not you do
this.

-- Chime


===================================

P Please consider the environment before printing this e-mail

Cleveland Clinic is ranked one of the top hospitals
in America by U.S.News & World Report (2009).  
Visit us online at http://www.clevelandclinic.org for
a complete listing of our services, staff and
locations.


Confidentiality Note:  This message is intended for use
only by the individual or entity to which it is addressed
and may contain information that is privileged,
confidential, and exempt from disclosure under applicable
law.  If the reader of this message is not the intended
recipient or the employee or agent responsible for
delivering the message to the intended recipient, you are
hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited.  If
you have received this communication in error,  please
contact the sender immediately and destroy the material in
its entirety, whether electronic or hard copy.  Thank you.
Received on Saturday, 11 September 2010 17:18:19 UTC