- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Fri, 10 Sep 2010 16:08:48 -0400
- To: Chimezie Ogbuji <ogbujic@ccf.org>
- Cc: Michel_Dumontier <Michel_Dumontier@carleton.ca>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
* Chimezie Ogbuji <ogbujic@ccf.org> [2010-09-10 13:45-0400] > Hello. Very interesting thread =). My $0.02. You say in your original > email: > > >>> This greatly simplifies our life as we are otherwise likely to have a > >>> variety of e.g. BP data in the database: 120/80 mmHg, 12/8 DmHg, > >>> 16000/10667 Pa, > >>> 16/11 MPa, 13 (PAM) > > I'm not so sure if the idea that databases with measurement data are likely > to have mixed units is very compelling in the realm of patient data. > Patient data is more than often local to a particular institution and their > conventions and so I would think that it is more likely that you will find a > more homogenous combination of units (where BP, for instance is primarily > measured in one unit or another depending on the institution). Certainly, > if you have an integrated dataset this assumption is less likely to hold, > but even then 1) I don't think the range of units you are likely to see in > such combined datasets will be that diverse - international conventions not > withstanding - and 2) Normalization into a canonical set (Vipul's > suggestion) seems a reasonable approach to adopt as part of the integration, > rather than to delay this normalization until the point when you query the > the data (making the query and any reasoning involved more complex). I agree that databases within an institution are likely to be homogeneous, at least for any given epoch, and even merging international sources, four blood pressure units is pretty small, but this represents an enormous increase in complexity in a query (five-fold increase in query size for the sample monitoring query at the bottom). I, and perhaps Vipul, am/are advocating canonicalization on the way into the standardized structure as opposed to at use time. The former makes querying and rules simpler and makes query federation tractable with conventional tooling. (Use-time normalization would require set of rules run over the data, which I believe reduces the distinction between SemWeb data and conventional use of data dumps.) > Personally, the idea of 'embedding' the units into the predicate doesn't > appeal to me, mainly because the predicate (for example): > > trans:systolicMmHg > > is overloaded to capture the meaning associated with systolic pressure *and* > the particular unit in which it was captured or represented. The former is > ontological and the latter is epistemic. However, the more practical issue > is that the set of terms for measurement will grow (quite rapidly) with the > number of different units you want to be able to represent in your dataset. I don't see the growth, given that for the terms I'd like to sanitize, I specifically want only one unit. That is, there'd be only :systolicMPa. > On 9/10/10 12:53 PM, "Eric Prud'hommeaux" <eric@w3.org> wrote: > > At W3, standardization includes detecting and eliminating redundant > > flexibility. If someone says "<img src='X'/> == <img href='X'/>", we > > say "pick exactly one or there will be bugs and inefficiency". To that > > end, I'd like the TMO task force to have exactly format for the tests > > worth standardizing, e.g. blood pressure. Further, I'd like users of > > the TMO to benefit from this stake in the ground; specificaly, I don't > > want them to query data that's half in MPa and half in mmHg. Voila my > > desire for one inflexible representation. > > So, isn't this an argument for normalization not so much for how you > represent measured values and their units? Yes, and my argument for representation is only relevent if we normalize. > > ..snip.. > > Normalization can also be enforced in the choice of > > predicate; we can say that the object of cpr:systolicBpMPa¹ is in MPa. > > We can write this down in the schema, and also as an OWL restriction. > > This moves the burden of inference from users of the standard to those > > who are mixing with data which has other units (a shrinking group when > > standardization is successful). > > I'm not sure I follow this rational. If you implement normalization in this > way and with such (overloaded) predicates, then determining the relationship > between the value and its unit is now a reasoning problem (i.e., you need to > 'interpret' the predicate WRT the ontology to determine the appropriate > units). It just seems more straight forward to have generic, > domain-independent predicates that directly relate a 'quality value' with > its units and scalar value, transformations and normalizations can then > happen at the point when data is being integrated, and the semantics of the > measured value is still understood. But then anyone merging two TMO documents with different units has the normalization burden. If we pick a unit and annotate the predicates, then the folks who would have to do the work of merging with non-TMO documents (who would have to introduce some rules/canonicalization pipeline anyways) have the OWL hooks to automate that merging. > Also, having domain-independent predicates makes it easier to render a view > of the data (for human consumption) that includes visual cues regarding the > units of measures associated with values directly from the data since such > tools will always expect the same set of terms to capture a value and its > unit of measurement. If you've bought the argument for early normalization, isn't it needlessly dangerous to offer the freedom to express BP in mmHg in an ontology that's required to have BP in MPa? It does put more burden on the use of generic data browsers (they'd have to read the OWL in order to present the user with units), but I think that use case is small compared to the cost of data consumption. > > I believe the principle counter argument to normalization is that this > > would be an obstacle to adoption; that e.g. clinics or pharmas who > > would otherwise be tempted to express their clinical data in CPR would > > be discouraged by the requirement of input normalization. > > Unless I have misunderstood you, it sounds like you think that the use of > muo:measuredIn and muo:numericalValue *requires* input normalization. I > don't think this is the case. These predicates say nothing about whether or > not the use of units are homogeneous or not. I don't believe that assertions require normalization, but I think that practical querying and inference, particularly federated query, require normalization. And if that's the form we want people to "publish", why provide a structure for the non-normalized form? > > I think that > > group is vanishingly small, especially if they face heterogeneous data > > and have to normalize anyways. It's possible that the arguments for > > homogeneous data (no query/inference-time normalization, trivial > > federation, etc.) are too subtle to persuade the above group, but I > > think the clinical web will be much better off if we can eliminate > > redundant flexibility. > > > > ¹ Chimezie, what do you think of this imposition on CPR? > > I don't think there is any imposition at all, especially if you use the > convention where there are separate predicates that relate the unit and the > value. If anything, the approach using overloaded predicates discourages > heterogeneous use of units, because people who query such datasets and > compose ontologies using these terms will then be faced with a proliferation > of terms. Whereas, even in a dataset with a heterogeneous set of units (for > the same kinds of measures), the way you write queries involving measured > data and the inferences involved are the same. We may both like normalization, I just want to make it required so we have the easiest time merging data from e.g. different clinics and labs, where that heterogeneity wouldn't be in the custodians' faces until merge time. > -- Chime > > > =================================== > > P Please consider the environment before printing this e-mail > > Cleveland Clinic is ranked one of the top hospitals > in America by U.S.News & World Report (2009). > Visit us online at http://www.clevelandclinic.org for > a complete listing of our services, staff and > locations. > > > Confidentiality Note: This message is intended for use > only by the individual or entity to which it is addressed > and may contain information that is privileged, > confidential, and exempt from disclosure under applicable > law. If the reader of this message is not the intended > recipient or the employee or agent responsible for > delivering the message to the intended recipient, you are > hereby notified that any dissemination, distribution or > copying of this communication is strictly prohibited. If > you have received this communication in error, please > contact the sender immediately and destroy the material in > its entirety, whether electronic or hard copy. Thank you. > ============================= # Find patients with a BP increase of more than 20mmHg over one month: SELECT ?fn ?ln ?date2 ?sy2 ?di2 { ?p :givenName ?fn ; :familyName ?ln . ?vis :patient ?p ; :date ?date1 ; :screening [ :systolicBP ?sys ; :diastolicBP ?sys ] . ?vi2 :patient ?p ; :date ?date2 ; :screening [ :systolicBP ?sy2 ; :diastolicBP ?di2 ] . FILTER ((?sy2 - ?sys > 20 || ?di2 - ?dia > 20) && ?date2 - ?date > "P30D"^^xsd::dateTimeDuration ) } # Same, with fielding 4 units: SELECT ?fn ?ln ?date2 ?sy2 ?di2 { ?p :givenName ?fn ; :familyName ?ln . ?vis :patient ?p :date ?date1 :screening ?pair . { ?pair :systolicBP [ :value ?sys ; :units ?usys ] ; :diastolicBP [ :value ?dia ; :units ?udia ] ; FILTER (?usys=u:mmHg && udia=u:mmHg ) } UNION { SELECT (?sys*10 as ?sys) (?dia*10 as ?dia) { ?pair :systolicBP [ :value ?sys ; :units ?usys ] ; :diastolicBP [ :value ?dia ; :units ?udia ] ; FILTER (?usys=u:dmHg && udia=u:dmHg ) } } UNION { SELECT (?sys*133 as ?sys) (?dia*133 as ?dia) { ?pair :systolicBP [ :value ?sys ; :units ?usys ] ; :diastolicBP [ :value ?dia ; :units ?udia ] ; FILTER (?usys=u:MPa && udia=u:MPa ) } } UNION { SELECT (?sys*133000 as ?sys) (?dia*133000 as ?dia) { ?pair :systolicBP [ :value ?sys ; :units ?usys ] ; :diastolicBP [ :value ?dia ; :units ?udia ] ; FILTER (?usys=u:Pa && udia=u:Pa ) } ?vi2 :patient ?p ; :date ?date2 ; :screening ?par2 . { # Report units are MPa ?par2 :systolicBP [ :value ?sy2 ; :units ?usy2 ] ; :diastolicBP [ :value ?di2 ; :units ?udi2 ] ; FILTER (?usy2=u:mmHg && udi2=u:mmHg ) } UNION { SELECT (?sy2*10 as ?sy2) (?di2*10 as ?di2) { ?par2 :systolicBP [ :value ?sy2 ; :units ?usy2 ] ; :diastolicBP [ :value ?di2 ; :units ?udi2 ] ; FILTER (?usy2=u:dmHg && udi2=u:dmHg ) } } UNION { SELECT (?sy2*133 as ?sy2) (?di2*133 as ?di2) { ?par2 :systolicBP [ :value ?sy2 ; :units ?usy2 ] ; :diastolicBP [ :value ?di2 ; :units ?udi2 ] ; FILTER (?usy2=u:MPa && udi2=u:MPa ) } } UNION { SELECT (?sy2*133000 as ?sy2) (?di2*133000 as ?di2) { ?par2 :systolicBP [ :value ?sy2 ; :units ?usy2 ] ; :diastolicBP [ :value ?di2 ; :units ?udi2 ] ; FILTER (?usy2=u:Pa && udi2=u:Pa ) } } FILTER ((?sy2 - ?sys > 20 || ?di2 - ?dia > 20) && ?date2 - ?date > "P30D"^^xsd::dateTimeDuration ) } -- -ericP
Received on Friday, 10 September 2010 20:09:24 UTC