Re: [COI] Clinical data RDF examples? from Conor Dowling on 2013-02-20 (public-semweb-lifesci@w3.org from February 2013)

From: Conor Dowling <conor-dowling@caregraf.com>
Date: Wed, 20 Feb 2013 09:45:13 -0800
To: "M. Scott Marshall" <mscottmarshall@gmail.com>
Cc: "Pathak, Jyotishman, Ph.D." <Pathak.Jyotishman@mayo.edu>, Alan Ruttenberg <alanruttenberg@gmail.com>, Kerstin Forsberg <kerstin.l.forsberg@gmail.com>, "Eric Prud'hommeaux" <eric@w3.org>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Charlie Mead <charliem22@yahoo.com>, Sajjad Hussain <sajjad.hussain@crc.jussieu.fr>, HCLS <public-semweb-lifesci@w3.org>
Message-ID: <CALfFB19JGf9n6c5c-OZ5ySRMQu4jgUHviMEpQncjdSuvx7+S_Q@mail.gmail.com>
Hello again Scott,

the requirement of the VistA work (http://vista.caregraf.info) is to expose
ALL VistA data as RDF described graph - not just carefully curated slivers,
to be fully automated - so no special clean up, and only then to normalize
and refine the result, either through entailment or a pipeline to produce
easier to digest representations. The native model is namespaced as "*VS*"
== vista schema. The refined model as "*VSN*", "vista schema normalized".

This report (http://vista.caregraf.info/analytics/knowThyVistAData.html)
describes what's exposed in VS and the need to refine that for VSN.

One key aspect of *VS* data: unlike relational data, VistA data is *NOT
normalized at all*. There is no one "Address Table" or "Diagnosis file" and
fields (predicates) are scoped to a file, not globally. The "same"
information is recorded in multiple files (RDF resources) each with their
own fields (predicates).
   - 243 fields (predicates) record a zip code, 111 record the national
provider identifier (CMS's national U.S. doctor id)
  and in the first pass/native/VS model, NONE of this is normalized. The
"many ways to express" are all exposed as is, with each field leading to a
unique predicate.

The other key point is *how do you broadly distinguish types of data*,
system data (a log of traffic) from institution data (ward in a hospital)
from knowhow (definition of a drug) from what you really want, patient
data? We termed this split *PIKS* (Patient-Institution-Knowhow-System).
Here *crawling the graph* comes into play. As you might imagine system data
DOES not link to a patient record, indirectly or directly. Nor does a drug
definition or a Ward description. *A patient record crawler only takes data
that directly or indirectly refers to the patient record*.

With these two approaches, you get the likes of this:
http://vista.caregraf.info/patients/graphs/vsLuluPatient_24.vdg (*VistA
Data Graph of a Patient* (VDG) from a real system, one with test data,
obfuscated.)

One other point on the VDG. Does it link to anything or is it full of dead
ends? You mentioned SNOMED but SNOMED's not in VistA. However, ICD9 is
obviously. Effectively so is RXNORM for drugs because the VA's national
drug file is in VistA and that scheme links to RXNORM. The way the graph
represents this is to distinguish two types of knowhow (the "K" in PIKS) -
stuff that is local to the system and so doesn't link out to knowhow
elsewhere and stuff that is a clone of national or standard data. VA
National Drug ids (called VUIDs) or ICD9 are examples of the latter.

In the VDG, you see this as ...

<http://vista.lulu.com/50-1812> a <http://datasets.caregraf.org/vs/50>;
<-------- a local drug definition
    rdfs:label "WARFARIN (C0UMADIN) NA 5MG TAB";
    owl:sameAs <http://schemes.caregraf.info/va/4013990> . <-------- links
out to VA national drug definition (it in turn links to /rxnorm/...)


Let me emphasize that there is no special coding here. FMQL (see
writeup<http://vista.caregraf.info/fmql>)
- a lightweight, SPARQL-like projection of VistA's noSQL store, has a table
telling it what is national/standard and what is local "knowhow" and its
export form can then make the likes of the above.

So the VS-based VDG gives you every piece of data, fully and unambiguously
defined and those graphs link out if the system's know-how links. But fully
== messy. We want neater, VSN graphs.

Two quick normalizations, to produce that ...
1) in the VS OWL definition, we declare things like vs:zip4-2 (the zip
predicate of a VS resource type) is a subProperty of vsn:zip.
2) we collapse sameas indirection. So in the above example of a drug, if
resource X refers to <http://vista.lulu.com/50-1812> we would replace that
reference in vsn with <http://datasets.caregraf.org/va/4013990>

With this, we end up with a much smaller set of predicates and remove local
knowhow. Now this is still a VistA model but the work now is *semantic
analysis which is what you want - real questions for analysts*.

Example of analysis is VistA's Prescription model. A Prescription in VistA
implicitly embeds one dispensation so you won't see dispense descriptions
if the Prescription is only filled once. That's different than most of the
ontologies out there. You can't divine this. You have to know it.

Finally, let me say that this approach is dramatically quicker than the
existing approach that involved hand crafting custom data extractors for
slivers of VistA data BUT let me add, *that old laborious way still
dominates inside the VA!*

I hope this isn't too much detail,
Conor

On Wed, Feb 20, 2013 at 5:52 AM, M. Scott Marshall <mscottmarshall@gmail.com
> wrote:

> Hello Conor, All,
>
> [I started send this request to Jyoti, Conor, and Alan but realize
> that this is probably interesting to others in HCLS such as Eric and
> Charlie so now CCing the mailing list. In a sense, this relates to
> Kerstin's efforts to create OWL/RDF versions of CDISC standards, which
> should also involve representations of patient information, but I
> suppose that my request is more oriented at the EHR level. Ideally,
> however, there should be identical triples for certain data, e.g.
> [patient hasAge Age] that can be used to match a patient with the
> eligibility criteria of a trial.  ]
>
> Just a quick question: I am hoping that you could share some of your
> OWL/RDF choices (schema) with me that you made with your OWLization of
> XXX (e.g. OpenVISTA).  So, I am looking for some choices of
> namespaces, coding systems (NCI vs. SNOMED etc.), predicates (patient
> hasDisease <disease>), etc. Even just a paste of some RDF for one
> patient record that contains (non-identifying) demographics, type of
> cancer, TNM status, etc. would be helpful, especially of an example
> cancer patient. Could you send something like that? I would appreciate
> any form of input, on or 'off list'.
>
> SWAT4LS this year fed in to a few interesting developments/connections
> related to clinical care and clinical research, among them:
>
> 1) SWAT4LS Paris Nov 2012:
> http://www.w3.org/wiki/HCLS/SWAT4LS2012/Hackathon
>
>
> http://www.w3.org/wiki/HCLS/SWAT4LS2012/Hackathon#CDA_and_Clinical_Trial_Protocols_to_RDF
>
> 2) EHR4CR hosted a meeting in Paris on Jan. 22:
>
> http://www.w3.org/wiki/HCLS/ClinicalObservationsInteroperability/Convergence
>
> The EHR4CR convergence meeting (2 above) will be followed up with
> discussions in COI teleconferences which I hope that you can also
> attend (Eric has kindly agreed to take us under his wing in COI).
> Charlie, who was present at 2, also enthusiastically supports
> continued discussion in this area. There will be an even larger
> convergence meeting held in March, organized by a EURECA colleague
> from EuroRec. Eventually (soon?), a merged summary report of the
> Convergence meeting will be shared on the wiki.
>
> Kind regards,
> Scott
>
> --
> M. Scott Marshall, PhD
> MAASTRO clinic, http://www.maastro.nl/en/1/
> http://eurecaproject.eu/
> https://plus.google.com/u/0/114642613065018821852/posts
> http://www.linkedin.com/pub/m-scott-marshall/5/464/a22
>
Received on Wednesday, 20 February 2013 17:45:48 UTC