Re: [TMO] patient record normalization from Eric Prud'hommeaux on 2010-09-12 (public-semweb-lifesci@w3.org from September 2010)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sun, 12 Sep 2010 12:05:16 -0400
To: Chimezie Ogbuji <ogbujic@ccf.org>
Cc: Michel_Dumontier <Michel_Dumontier@carleton.ca>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
Message-ID: <20100912160514.GA29811@w3.org>
* Chimezie Ogbuji <ogbujic@ccf.org> [2010-09-11 13:16-0400]
> On 9/10/10 4:08 PM, "Eric Prud'hommeaux" <eric@w3.org> wrote:
> >> ..snip ..
> >> I'm not so sure if the idea that databases with measurement data are likely
> >> to have mixed units is very compelling in the realm of patient data.
> >> Patient data is more than often local to a particular institution and their
> >> conventions and so I would think that it is more likely that you will find a
> >> more homogenous combination of units (where BP, for instance is primarily
> >> measured in one unit or another depending on the institution).  Certainly,
> >> if you have an integrated dataset this assumption is less likely to hold,
> >> but even then 1) I don't think the range of units you are likely to see in
> >> such combined datasets will be that diverse - international conventions not
> >> withstanding - and 2) Normalization into a canonical set (Vipul's
> >> suggestion) seems a reasonable approach to adopt as part of the integration,
> >> rather than to delay this normalization until the point when you query the
> >> the data (making the query and any reasoning involved more complex).
> > 
> > I agree that databases within an institution are likely to be
> > homogeneous, at least for any given epoch, and even merging
> > international sources, four blood pressure units is pretty small,
> > 
> > but this represents an enormous increase in complexity in a query
> > (five-fold increase in query size for the sample monitoring query at
> > the bottom).
> 
> The source of the complexity is inevitable (in either case) if you don't
> know what units are being used beforehand and you will need a disjunction
> for every combination as in your use of UNIONs below, but as was said later
> in the thread, you can easily make appropriate constraints in the ontology
> so it is clear which units are being used.
> 
> > I, and perhaps Vipul, am/are advocating canonicalization on the way
> > into the standardized structure as opposed to at use time. The former
> > makes querying and rules simpler and makes query federation tractable
> > with conventional tooling. (Use-time normalization would require set
> > of rules run over the data, which I believe reduces the distinction
> > between SemWeb data and conventional use of data dumps.)
> 
> Yes, I agree with this. *If* you do decide to normalize your units then you
> certainly want to do it before hand *and* you want to provide some
> indication as to what those canonical units are.
> 
> >> Personally, the idea of 'embedding' the units into the predicate doesn't
> >> appeal to me, mainly because the predicate (for example):
> >> 
> >> trans:systolicMmHg
> >> 
> >> is overloaded to capture the meaning associated with systolic pressure *and*
> >> the particular unit in which it was captured or represented.  The former is
> >> ontological and the latter is epistemic.  However, the more practical issue
> >> is that the set of terms for measurement will grow (quite rapidly) with the
> >> number of different units you want to be able to represent in your dataset.
> 
> > I don't see the growth, given that for the terms I'd like to sanitize,
> > I specifically want only one unit. That is, there'd be only :systolicMPa.
> 
> If you are normalizing, then yes you will only have one such predicate in
> each case, but then every dataset with a canonical combination of units for
> a particular kind of measure will have its own single term that captures the
> semantics of both what is measured and the units used and this is not very
> interoperable from the perspective of a person who wants to ask the same
> query from one dataset to the next.

intra-vocabulary use case:

That's where the objective of standardization comes in. If we are
hoping that folks will use this ontology, then it makes sense to
minimize the cost for using it. The exchange and exploitation of data
expressed in TMO is the use case I'd put at highest priority.


inter-vocabulary use case:

Another use case is interfacing with folks who aren't using the
standard, but are using RDF to represent data in the same domain.
In the circumstances where their data is, or can be, expressed in
terms of muo, e.g.

  [ theirs:sys [ muo:measuredIn muo:dmHg ; muo:numericalValue 12 ] ;
    theirs:dia [ muo:measuredIn muo:dmHg ; muo:numericalValue 8 ] ] .

, the OWL description of tmo:systolicMPa will reveal that it also
implies the same structure, and they can adapt the units as needed.


> > Yes, and my argument for representation is only relevent if we normalize.
> 
> Okay. My misunderstanding.  I thought they were being discussed separately.

Nah, if we we're going to have heterogeneous units, I absolutely want
to use a popular convention to express that.

> >> I'm not sure I follow this rational.  If you implement normalization in this
> >> way and with such (overloaded) predicates, then determining the relationship
> >> between the value and its unit is now a reasoning problem (i.e., you need to
> >> 'interpret' the predicate WRT the ontology to determine the appropriate
> >> units).  It just seems more straight forward to have generic,
> >> domain-independent predicates that directly relate a 'quality value' with
> >> its units and scalar value, transformations and normalizations can then
> >> happen at the point when data is being integrated, and the semantics of the
> >> measured value is still understood.
> > 
> > But then anyone merging two TMO documents with different units has the
> > normalization burden.
> > If we pick a unit and annotate the predicates,
> > then the folks who would have to do the work of merging with non-TMO
> > documents (who would have to introduce some rules/canonicalization
> > pipeline anyways) have the OWL hooks to automate that merging.
> 
> Yes, this is true.  So, I guess the trade off is ease of merging versus
> interoperable querying.
> 
> >> Also, having domain-independent predicates makes it easier to render a view
> >> of the data (for human consumption) that includes visual cues regarding the
> >> units of measures associated with values directly from the data since such
> >> tools will always expect the same set of terms to capture a value and its
> >> unit of measurement.
> > 
> > If you've bought the argument for early normalization, isn't it
> > needlessly dangerous to offer the freedom to express BP in mmHg in an
> > ontology that's required to have BP in MPa?
> 
> There is no freedom if the ontology expresses this requirement and is used
> to interpret the data:  the RDF graph wouldn't be satisfiable.

Right, but there is an implied freedom. If someone is writing
something to export say I2B2 into TMO and their data is in mmHg,
they'll see the units/value pairing and think their job is done.

Yes, if they employ an A-box consistency checker with the right OWL
would tell them that the class tmo:TMO_00000298 is unsatisfied in all
of their data, but they'd have to dig a bit and we've led them down a
bit of a garden path. The odds that we either lose producers or that
they produce invalid data are substantially higher if there's a
suggestion of unit freedom.


> > It does put more burden on
> > the use of generic data browsers (they'd have to read the OWL in order
> > to present the user with units), but I think that use case is small
> > compared to the cost of data consumption.
> 
> So, this is probably not salient to the main thrust of this thread, but
> hints at a greater problem I have with Linked Data and the tension between
> the use of ontologies (and rules) to interpret data versus relying on HTTP
> interactions for the same purpose.  If the data (and the domain it reflects)
> is rich in meaning, then logical interpretation is the more appropriate tool
> to 'understand the data'.  Otherwise, relying on URI lookup is probably more
> appropriate for use with generic browsers.  Another way of thinking about it
> is the difference between 'informed introspection' v.s. 'follow your nose'.

> Generic data browsers will always have this burden when faced with
> ontologically rich data, since they are (by definition) generic and by
> generic, I take it you mean they don't have or use a reasoner.

Ahh no, but "generic data browser", I was seeking some use case
(beyond the heterogeneous-unit integration, which was already
explored) in which there was some use of the units vocabulary.
In generic-unit predicates, they can display the data directly;
in my proposal, they need to use the OWL closure.


> >> Unless I have misunderstood you, it sounds like you think that the use of
> >> muo:measuredIn and muo:numericalValue *requires* input normalization.  I
> >> don't think this is the case.  These predicates say nothing about whether or
> >> not the use of units are homogeneous or not.
> > 
> > I don't believe that assertions require normalization, but I think
> > that practical querying and inference, particularly federated query,
> > require normalization. And if that's the form we want people to
> > "publish", why provide a structure for the non-normalized form?
> 
> The use of muo:measuredIn and muo:numericalValue is orthogonal to
> normalization.  What I understood you and Vipul to mean by unit
> normalization is to say: for a particular measurement (BP in this case), we
> will always use this particular unit in our dataset.  The use of
> muo:measuredIn and muo:numericalValue doesn't restrict whether or not you do
> this.

agreed. I was trying to re-factor the discussion a bit by looking at
the non-normalized pipeline:
  publisher:
    p1. acquire BP in mmHg (from database, BP cuff, XML EPR...)
    p2. create TMO patient record (or non-materialized view)
  consumer:
    c1. normalize potentially relevent TMO patient records
    c2. integrate (concatonation, query federation, etc.)
A non-normalized TMO would describe the product of p2.

Examining the normalized pipeline:
  publisher:
    np1. acquire BP in mmHg (from database, BP cuff, XML EPR...)
    np2. create normalized TMO patient record (or non-materialized view)
  consumer:
    nc2. integrate (concatonation, query federation, etc.)
A normalized TMO would describe the product of np2, which would equal
c1 in a non-normalzied TMO. That is, it would define the ontology used
when folks needed to integrate the data. The publisher might have some
non-normalized step between np1 and np2, but users of TMO don't need
to know about it.

The question I was trying to ask is, if the utility for our primary
use case comes from np2/c1, why standardize p2?


> -- Chime
> 
> 
> ===================================
> 
> P Please consider the environment before printing this e-mail
> 
> Cleveland Clinic is ranked one of the top hospitals
> in America by U.S.News & World Report (2009).  
> Visit us online at http://www.clevelandclinic.org for
> a complete listing of our services, staff and
> locations.
> 
> 
> Confidentiality Note:  This message is intended for use
> only by the individual or entity to which it is addressed
> and may contain information that is privileged,
> confidential, and exempt from disclosure under applicable
> law.  If the reader of this message is not the intended
> recipient or the employee or agent responsible for
> delivering the message to the intended recipient, you are
> hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited.  If
> you have received this communication in error,  please
> contact the sender immediately and destroy the material in
> its entirety, whether electronic or hard copy.  Thank you.
> 

-- 
-ericP
Received on Sunday, 12 September 2010 16:05:52 UTC