Re: What should we call RDF's ability to allow multiple models to peacefully coexist, interconnected? from David Booth on 2014-03-08 (semantic-web@w3.org from March 2014)

From: David Booth <david@dbooth.org>
Date: Sat, 08 Mar 2014 18:30:59 -0500
To: "Timothy W. Cook" <tim@mlhim.org>
CC: Alan Ruttenberg <alanruttenberg@gmail.com>, semantic-web <semantic-web@w3.org>
Message-ID: <531BA833.5070801@dbooth.org>
Hi Tim,

On 03/08/2014 04:36 PM, Timothy W. Cook wrote:
> A very interesting and I think, foundational discussion.  David, thanks
> for bringing it up.
> Below is a discussion of why I believe that RDF should be considered a
> layer over data models or maybe as 'semantic glue'.
>
>   David, we are working on the same type of problem but from slightly
> different perspectives.  The presentation that you linked to re:KnowMED,
> is very important and I recall seeing it before.  I'll take this
> opportunity to comment on it since it is in the context of this
> discussion.  The indicates that you propse RDF as a language to be used
> in the exchange of healthcare data.  Then on slide #5 you say it isn't
> enough to 'get us there'.  So I am not sure how much of this is
> marketing swagger and how much is hard fact.

No marketing swagger intended on that slide.  The reason I said that RDF 
is not enough to get us there (i.e., to semantic interoperability) is 
because the universal use of RDF *alone* is not enough to ensure 
semantic interoperability, for multiple reasons.  The basic reason is 
that semantic alignment is needed for semantic interoperability: the 
receiver of some RDF will not be able to make effective use of that RDF 
unless it is semantically aligned with the models and vocabularies that 
it understands.  RDF alone does not give you that semantic alignment, 
but it does provide a common substrate language that *facilitates* 
semantic alignment, both because different models and vocabularies can 
be intermixed without conflict, and because it is amenable to inference, 
which can allow data in one model/vocabulary to be used as though it 
were data in another model/vocabulary.

>   On slide #8 item #2 we are 100% in agreement.  But then on slide #9
> you are mixing apples and oranges.  XML and RDF have two different
> purposes that work well together.

I certainly agree that XML and RDF can work well together, and I 
partially agree that comparing them is like mixing apples and oranges. 
I also agree that XML and RDF *should* have different purposes.  But a 
common erroneous assumption that I have seen from many who come from an 
XML background -- before they really understand RDF -- is to assume that 
XML and RDF have the *same* (or similar) purposes.  They are accustomed 
to using their XML hammers, so when they run into an information 
integration problem, they attempt to use XML for that too, and that is 
*not* a good use of XML, whereas it *is* a great use of RDF.  So the 
point of that comparison with XML was to highlight this difference in 
the way XML and RDF are normally used: XML being schema-centric, and RDF 
allowing multiple models to coexist.  Obviously XML *can* be used 
differently -- the RDF/XML syntax is proof enough of that.  But in terms 
of the way XML is typically conceived and used, I have found that it is 
typically very schema-centric: the designer decides on *the* data model, 
documents it in an XML Schema definition, etc.  And that's qualitatively 
different from the RDF world, in which multiple models routinely coexist.

>   On further slides, your Blue, Green and Red customers exactly indicate
> what I mean by RDF being an essential layer on top of multiple models.
>   What happens further in the presentation is where we disagree.  You
> assert that RDF should be the language used to actually 'exchange' data.

Yes, I think it is fine for exchanging parties to use whatever they want 
internally, but the information should be *exchanged* in RDF.

HOWEVER, I do not mean that healthcare information needs to be exchanged 
in a native RDF *format* like Turtle, N-Triples, etc.  It is fine if it 
uses any standard data format (HL7, XML, whatever), provided that there 
is a standard *mapping* from that format to the RDF model, so that the 
**information content** is still RDF (even if it doesn't *look* like RDF).

> This where RDF and the tools around it (AFAIK) are not mature enough to
> perform.  Several times you have mentioned 'semantics and not syntax'.
> This is a huge mistake.  You must have both in order to insure data
> quality and meaning.

Yes, you cannot get to the semantics without the syntax.  But the point 
is that, historically, only the syntax was addressed.  The semantics 
were ignored and left to chaos.   So yes, I agree that we need both, but 
I think it's important to place the emphasis on the semantics.  The 
syntax is just a necessary detail.

> Secondly we know from history that top-down
> consensus in healthcare concept modelling is an impossibility.[1]

Agreed.  And I like your Cavalini-Cook theory!  I don't know if the 
formula is exactly right, but the basic idea is spot on, regarding the 
difficulty in creating standards as the committee size grows.

This is a key reason why RDF can help, because it accommodates both 
standards *and* innovation, plus it supports inference, which can be 
used to achieve semantic alignment post facto.

>   In your post describing the BP screenshot you said:
>   "Thus, although ex1:bp_023 and ex2:bp409 capture the same blood
> pressure information, they represent that information differently.
>   Nonetheless, both representations can peacefully coexist in the same
> merged RDF data without conflict, which might happen, for example, if
> one is derived from the other through inference."
> I take this to mean that you are representing the exact same BP
> measurement data in two different ways?

Yes, in that illustration.

> Your use case, 'by inference'
> is a little fuzzy for me.  If it is derivation by inference, it will
> just be an in memory representation and not persisted; correct?

Inferences can be persisted or not, as desired.

> Irregardless, the existence of the same data instance, in the same
> application is in complete contradiction to good data quality
> management.As you go on to explain, now you must add application
> intelligence to analyze whether or not two data instances are the same
> or not to avoid counting them as two separate instances.  This is
> approach is very dangerous, in addition to adding complexity and cost to
> the applications.

Yes, I totally agree.  The example was not intended to illustrate good 
data management techniques.  It was only intended to illustrate the idea 
that two different data models can exist in the same dataset, 
semantically interrelated.

> However, having the ability to determine if two
> different data instances exactly match the same concept is essential.
>   Minor differences such as the position of the patient (stitting or
> prone) or the type of instrument used to perform the measurement or the
> location on the body (left upper arm or right thigh, etc.) that the
> measurement was taken are all important.  They may or may not rule in or
> out specific measurements, based on the intended use of the query
> results.  This is where RDF is essential, do these two instances point
> to exactly the same code in a controlled vocabulary, etc.?    These
> questions are essential to having the ability to perform machine based
> reasoning over the data repository; whether at the point of care or for
> research purposes.

Entirely agree.  And inference can allow sitting BP measurements to be 
returned when a query does not care whether the patient was sitting or 
standing during the measurement.

> Refering back for a moment, to 'the same data instance' situation.  It
> is essential to have additional information (meta-data) to determine if
> two instances are are exactly the same.

Agreed.  The example I gave did not illustrate that, but I completely agree.

> This can legitimately occur
> during aggregation for research or systemic quality analysis.  Unique
> patient identifiers along with datetime stamps are ideal.  However, the
> patient identifier issue is an ongoing problem that is actually
> implementation context and application specific.  It is outside of the
> context of data quality and management.
>   Slide #22 clearly indicates that there is an expectation that RDF is
> used as a common format.

No, that's not the intended reading of that slide.  Maybe I need to 
clarify it.  As explained above, *any* exchange format is okay -- 
including native RDF formats -- provided that they have standard 
mappings to the RDF model, so that the exchanged **information content** 
that is understandable in RDF.

> However, as I said earlier, the current
> implementation of RDF is not robust enough to perform this function,
> UNLESS, there is a global expert consensus on all healthcare concepts so
> that models may be created and distributed from a central authority.

That is the role of standards organizations, producing standards like 
SNOMED, LOINC, RXNORM, etc.  And one of the great things about RDF is 
that it can leverage standards when they are available (by mapping to 
RDF), but still allow innovation in areas where standardization has not 
been achieved.  In essence, it de-couples the substrate language (RDF) 
from the data models and vocabularies.

>   This is simply unrealistic as history has shown and is formalized in
> the Cavalini-Cook theory [1].
> The reason that I state that RDF is not capable, at this point of
> maturity, is that it doesn't support the ability to represent syntactic
> structures in a multi-level model environment.  IOW: There is no ability
> (AFAIK) to express a common reference model and then derive concepts
> models that issue further constraints.  A multi-level model approach is
> essential in order to abstract the syntax and semantics of each concept
> out of the application source code and repository schemas so that they
> can be shared between disparate applications.  This is what provides for
> full syntactic and semantic interoperability.

A multi-level model approach may indeed help.  And a nice thing about 
RDF is that it lets the market decide what models and vocabularies to use.

> A multi-level model approach may or may not be useful in many domains.
>   Specifically, human engineered domains that we fully understand can be
> modeled as one level representations.  However, biological domains that
> involve evolutionary complexity are quite different.  Primarily because
> we do not fully understand them so our science and understanding is
> constantly changing.  Additionally, it appears that the data has a much
> longer lifetime of significance than other domains.  Therefore the data
> should be initially captured and represented in a manner that makes it
> as future proof and reusable as possible.

Agreed.  And RDF is great for that.  However, I think the argument for 
use of RDF for *exchange* is much stronger.  That's why I'm pushing for 
that first.

> In healthcare, the most
> semantically rich point of any information is at the point of care.
>   Every point of transition/translation after that will most assuredly
> lose context.  As a brief example; reference ranges for conditions
> change over time.  It is essential that data captured today be expressed
> in the context of today's knowledge, even 20 or more years from now.
>   The concept model around high blood pressure is different than it was
> 10 years ago.
>
> Where RDF shines is that in a syntactic model of a concept designed to
> capture reference ranges and other metadata, it can be used to provide
> external semantic context to that model.  Whether that context exists in
> a controlled vocabulary or even free text documents such as clinical
> guidelines.

Yes, it is good for that too.

>
> In the Multi-Level Healthcare Information Modelling (MLHIM) approach we
> developed a conceptual reference model to provide a basis for software
> implementations. While the MLHIM model doesn't preclude other
> serializations, we found that XML Schema 1.1 does provide the
> prerequisites for implementation both a reference model and concepts
> models.

Yes, schema validation is one strength of XML.  FYI, the W3C is 
considering doing work on RDF validation, and last fall ran a workshop 
on the topic, soliciting input:
http://www.w3.org/2012/12/rdf-val/
You might want to track that work if you later decide to consider a more 
RDF-oriented path.

Thanks very much for your comments.

David

> This means that we can have full validation of instance data
> back to the W3C specifications.  By marking up the concept models (XML
> Schema 1.1 annotations) with RDF providing the computable semantic links
> for each model as defined by the modeller.  These models can now be
> created by domain experts (with additional knowledge modelling training)
> so that software developers do not have to interpret the meanings.
> The concept models are now fully detached from any specific
> implementation and can be shared to use for validating instance data in
> the context in which it was recorded.  I believe that this is the
> closest we have to semantic interoperability, to date.  I am of course
> open for discussion and debate on the issue.  I used the acronym 'AFAIK'
> a few times above.  I used this because my last serious attempt to use
> RDF for this purpose was in 2010/2011.  I know that there is a
> continuous maturing process going on.  I believe that there may come a
> day when RDF and OWL can be used exclusively for syntactic and semantic
> representation and reasoning.  But AFAIK, not today.
>   We have a significant number of peer-reviewed publications about MLHIM
> and academic as well as other implementations. I am happy to share those
> with the group or you may peruse the links in my signature line as well
> as www.mlhim.org <http://www.mlhim.org> and the specs are openly
> downloadable from here[2] as a package and as source from here [3].
> We also have  almost 2000 datatypes converted from other modeling
> approaches (such as the NIH CDE browser and HL7 FHIR) into reusable
> complexTypes to be used in concept models.  You can review those as well
> as download some example concept models from here[4].  Free registration
> is required to download the models.
>   Kind Regards,
>   Tim
>   [1]
> https://github.com/mlhim/specs/blob/2_4_3/graphics/cavalini_cook_theory.png
>   [2]
> https://launchpad.net/mlhim-specs/2.0/2.4.3/+download/mlhim-specs-2013-10-15-2.4.3-Release.zip
>   [3] https://github.com/mlhim/
>   [4] http://www.ccdgen.com
>
>
> On Fri, Mar 7, 2014 at 5:00 PM, David Booth <david@dbooth.org
> <mailto:david@dbooth.org>> wrote:
>
>     Hi Alan,
>
>
>     On 03/07/2014 12:44 PM, Alan Ruttenberg wrote:
>
>         Can you explain what you mean by "RDF's ability to allow
>         multiple data
>         models to peacefully coexist, interconnected, in the same data" ?
>
>
>     Yes.  Here is an imprecise illustration, on slides 10-17:
>     http://dbooth.org/2013/__semtech/slides/03-DavidBooth-__rdf-as-universal.pdf
>     <http://dbooth.org/2013/semtech/slides/03-DavidBooth-rdf-as-universal.pdf>
>     (I took some artistic liberties blurring class/instance distinctions
>     in that diagram.)
>
>     And here is a more precise example that cleanly distinguishes
>     classes from instances:
>     http://tinyurl.com/pzsgf7f
>     (I've also attached the same illustration, for offline readers.)
>
>     In this latter example (of a hypothetical systolic blood pressure
>     measurement), the same information is represented according to two
>     different models/schemas/vocabularies/__ontologies, v1 (green) and
>     v2 (red).  (I am using the terms model, schema, vocabulary and
>     ontology loosely and somewhat interchangeably here.)
>
>     In the v1 model, the systolic blood pressure is indicated in RDF
>     like this:
>
>        ex:patient319 foaf:name "John Doe" ;
>          v1:bps ex1:bp_023 .
>
>        ex1:bp_023 a v1:SystolicBPSitting_mmHg ;
>          v1:value 120 .
>
>     Whereas in the v2 model, the same information is represented
>     differently, in RDF like this:
>
>        ex:patient319 foaf:name "John Doe" ;
>          v2:bps ex2:bp_409 .
>
>        ex2:bp_409 a v2:SystolicBP ;
>          v2:pressure 120 ;
>          v2:units v2:mmHg ;
>          v2:bodyPosition v2:sitting .
>
>     Thus, although ex1:bp_023 and ex2:bp409 capture the same blood
>     pressure information, they represent that information differently.
>       Nonetheless, both representations can peacefully coexist in the
>     same merged RDF data without conflict, which might happen, for
>     example, if one is derived from the other through inference.
>
>     Furthermore, the relationship between these classes,
>     v1:SystolicBPSitting_mmHg and v2:SystolicBP, and hence the
>     relationship between the corresponding v1 and v2 instance data, can
>     also be explicitly captured in RDF, as the v1v2:SystolicBP_Transform
>     (yellow) relationship:
>
>        v1:SystolicBPSitting_mmHg v1v2:SystolicBP_Transform v2:SystolicBP .
>
>     Inference rules for v1v2:SystolicBP_Transform could therefore
>     convert a v1:SystolicBPSitting_mmHg measurement to a v2:SystolicBP
>     measurement or vice versa.
>
>     This example only illustrated the case where the transformation from
>     one model to the other is lossless and thus reversible.  Usually
>     that isn't the case.  Relating models and transforming between them
>     is *not* easy, but at least RDF makes it possible to explicitly
>     indicate these relationships.
>
>     Obviously some intelligence must be exercised to avoid, for example,
>     accidentally thinking that ex:bp_023 and ex2:bp_409 represent two
>     distinct blood pressure measurements, and thereby double counting
>     them, but that's easy enough to do.
>
>     Also, there isn't always a desire to relate or transform between
>     models.  Sometimes some data is related and other data is not, and
>     it is all still merged into the same RDF graph.  In fact, the point
>     may be to connect that part of the data that *is* related and let
>     the rest coexist without being connected (or at least not *directly*
>     connected).
>
>     The point is that these data models can peacefully coexist in RDF
>     data without conflict: applications using the v1 model against the
>     merged data might only see v1 instance data, whereas applications
>     using the v2 model might only see the v2 data.  That's qualitatively
>     different than in the world of XML, for example, where one schema
>     generally wants to be "on top", and when you merge XML of different
>     schemas, you need to create a new "top" schema.  That is the
>     difference that I have so often tried to explain to people outside
>     the RDF community, and what I am trying to capture succinctly in a
>     term or phrase.   It isn't an easy idea to convey to those who are
>     accustomed to a schema-centric approach.  I think a catchy but
>     descriptive term or phrase could help.
>
>     Thanks,
>     David
>
>
>         -Alan
>
>
>         On Fri, Mar 7, 2014 at 11:20 AM, David Booth <david@dbooth.org
>         <mailto:david@dbooth.org>
>         <mailto:david@dbooth.org <mailto:david@dbooth.org>>> wrote:
>
>              I -- and I'm sure many others -- have struggled for years
>         trying to
>              succinctly describe RDF's ability to allow multiple data
>         models to
>              peacefully coexist, interconnected, in the same data.  For data
>              integration, this is a key strength of RDF that
>         distinguishes it
>              from other information representation languages such as
>         XML.   I
>              have tried various terms over the years -- most recently
>         "schema
>              promiscuous" -- but have not yet found one that I think
>         really nails
>              it, so I would love to get other people's thoughts.
>
>              This google doc lists several candidate terms, some pros
>         and cons,
>              and allows you to indicate which ones you like best:
>         http://goo.gl/zrXQgj
>
>              Please have a look and indicate your favorite(s).  You may
>         also add
>              more ideas and comments to it.  The document can be edited
>         by anyone
>              with the URL.
>
>              Thanks!
>              David Booth
>
>
>
>
>
> --
> MLHIM VIP Signup: http://goo.gl/22B0U
> ============================================
> Timothy Cook, MSc           +55 21 994711995
> MLHIM http://www.mlhim.org
> Like Us on FB: https://www.facebook.com/mlhim2
> Circle us on G+: http://goo.gl/44EV5
> Google Scholar: http://goo.gl/MMZ1o
> LinkedIn Profile:http://www.linkedin.com/in/timothywaynecook
Received on Saturday, 8 March 2014 23:31:30 UTC