Re: Project with D2R from Eric Prud'hommeaux on 2008-06-24 (public-hcls-coi@w3.org from April to June 2008)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 24 Jun 2008 09:52:22 -0400
To: John Madden <madden.jf@gmail.com>
Cc: public-hcls-coi@w3.org
Message-ID: <20080624135222.GH7266@w3.org>
* John Madden <madden.jf@gmail.com> [2008-06-23 23:16-0400]
> Hi Eric et al,
>
> I'm pleased you asked about the project with D2R. I'll describe it. I  
> have no idea whether it will get funded. It's not nearly as ambitious a 
> project as the one COI is proposing, but it may be of interest.

COI doesn't strike as too ambitious. I think that completing the
end-to-end test would demonstrate the capacity of RDF-related tools to
meet the use case. It shows the versatility of RDF graph mappings, and
thereby some utility in treating your data as RDF.

The use cases don't specifically require node mapping, just arcs. I
find node mapping harder to do reversibly and am playing with
different expressions, e.g. an onto regexp mapping in SPARQL:

map:subst-onto(?implPatient, ?ifacePatient, 
  "http://impl.example/billing/((:PAT:subject)|(:PHY:physician))s/(:NO:[0-9]+)",
  "http://mgh.org/AllIds/((:PAT:patient)|(:PHY:doctor))-(:NO:[0-9]+)#info")

(The association of :NO: is easily performed, but mapping a :PHY:
 physician to doctor and back again requires enumeration of the
 disjunction options).

D2R has ways to do this; I need to pick your brains about this.

> Our use case is also Clinical Trials. The scenario we're targeting is,  
> you have one or more Clinical Trials Management Organizations (CTMOs),  
> each managing one or more clinical trials, enrolling patients at  
> multiple clinical sites. At any given site, multiple trials and multiple 
> CTMOs may be active simultaneously or sequentially. A given patient may 
> participate simultaneously or sequentially in more than one trial.
>
> Each trial has its own data requirements, that consists about 80% of  
> participant data that will get prospectively collected in the course of 
> the trial using dedicated, trial-specific data entry instruments  
> (trial-specific forms, screens, etc.). But about 20% of the data is  
> retrospective patient data needed to establish baseline demographics and 
> "pre-existing conditions" (which is clinical trials lingo for "what other 
> diseases/conditions does the patient have?") etc. For a given patient, 
> this data would be pretty much the same from trial to trial.

I guess there's always the slim possibility that some treatment will
kill or cure the patient...

> This retrospective data "lives" in existing electronic medical records  
> stored in existing clinical data stores (almost always relational DBs  
> that are the backends to the clinic/hospital electronic medical record  
> (EMR) application/system). Mostly, these have proprietary table designs 
> that are specific to the EMR vendor. Although these designs may sometimes 
> exploit standard coding systems as table keys, just as often they use 
> entirely vendor/system-local keys.

Sorry, not following this. A table key is like a regular key in
relational? If so, how is a vendor key different? Invisible through
the API?

> Traditionally, when patients have been enrolled into a trial the CTMO  
> sends a data management specialist to each site, who works with the  
> local DB administrator to develop an extract-transform-load (ETL)  
> scenario to get the required data out of the local DB into the trials  
> DB.
>
> But wouldn't it be much nicer if a trial administrator for a given CTMO 
> could treat all the site-specific databases as if they were a single 
> large database with a  common "virtual schema", and then he/she could 
> formulate the ETL scenario just once (it wouldn't be ETL anymore then— it 

What sort of T happens here? How much does it depend on manual coding?

> would just be a query). Even better, if all the CTMO's could form a 
> consortium and agree to treat all the site-specific databases according 
> to a single, shared "virtual schema", they all could benefit.

Need to convince them that others will do it and leave them behind.
Industry leaders are seldom inspired to reduce vendor lock-in.

OTOH, RDF is a good choice for extensible shared schemas. May be
easier for them to swallow exchange using a data model that is more
atomic than SQL, and has clearly defined extension/partial
interpretation semantics.
Do you know anyone to try this pitch on?

> Important: This schema would cover only items of specific interest in  
> this application, namely (a) demographic information and (b) pre- 
> existing conditions. Thus, it would be what Vipul called a "niche  
> ontology".

How about proposing it as a niche profile, meaning that it uses (or
happens to use, if you perfer to understate this point) terms from a
larger, more integrated ontology?

> The consortium structure is ideal for a web-based implementation,  
> because then there's no overhead in setting up your infrastructure. You 
> could, of course, formulate a virtual RDB schema to support queries, but 
> why not take advantage of the web-friendliness and extensibility of 
> RDF/OWL to specify your shared vocabulary, and formulate your queries in 
> SPARQL.

A major strike against SQL for this is that it has no global
identifiers, and encoding URLs in SQL name-friendly characters.
  http_c_s_sexample_dorg_ssome_spath_hsome__fragment

> A participating CTMO would transmit its SPARQL to participating sites  
> (In a variant--and cooler--scenario, CTMO's don't even need to know who 
> the participating sites are; the sites control their own participation by 
> subscribing through a proxy to a feed that carries the query stream).

How would they advertise their available info?
     expressivity graph description
  <siteX> :supports :CTMO_level_1 .       # turtle  by identifier
vs.
  <siteX> :supports {
      ?patient :participatesIn ?study .   # n3  by graph pattern
      ?patient :chronic ?snomed-disease }.
vs.
  <siteX> :supportsSPARQL " ...
      ?patient :participatesIn ?study .   # sparql by graph pattern
      ?patient :chronic ?snomed-disease ".

Because I use have a SPARQL parser already, I'm using the 3rd form to
transform data from the implementation schema to the interface
schema. Sharing that same graph pattern for the interface scheme with
the world tells them what data they can ask me about.

> But what to do when the SPARQL hits the clinical sites? You install at  
> each participating site a D2R server that listens for queries. For each 
> site you have created a D2R mapping file covering RDF-to-SQL translation 
> over the fairly limited shared vocabulary of interest (demographics, 
> maybe 50 concepts; pre-existing conditions, maybe 100 concepts). Creation 
> of the mapping file is manual, labor-intensive and unique to each site, 
> but only needs to be done once.

SPASQL has effectively the same cost, but (so far) uses graph pattern
⇔ graph pattern rules instead of SQL ⇔ graph pattern rules (the final
mapping to relational comes by prefixing the relations/attributes with
a stem URI).
  impl graph is a function of the RDF and a stem URI
  interface graph = impl2interface rules(impl graph)

How does this expressivity compare with d2r? Will I have to abandon my
scheme and use something less general? Dunno. I'd like your help in
figuring that out.

> The participating site agrees to let the D2R engine be a database user  
> with appropriate privileges (but what is "appropriate"? -- this is very 
> tricky. Since the Mr. D2R will be transmitting the data it pulls on to 
> third parties (the CTMO's), this gets to difficult policy concerns 
> regarding specificity of consent, and is a separate part of our 
> proposal).

I've been toying with associating various interface maps with
principles. The query engine can fire the rules authorized by your
request credentials, including rules with caveats in them ( FILTER
(?user = "bob") ). Again, would like to geek out real use cases and
see if this works.

> Another problem: Some of the pre-existing conditions data is not  
> actually going to be represented in granular form in most clinical DBs; 
> it will more often be embedded someplace in text blobs (e.g. problem 
> lists, discharge summaries, etc.). So you may need to perform some tricks 
> in formulating your SQL queries by incorporating some text searching, 
> pipelining in some natural language processing steps, or worst case 
> maintaining auxiliary full-text indexes on the object database.

Weak! I guess this is the cost of trying to automate new problems —
the orig data is not already tailored to extract the info you need.

> We never thought about rule-based mappings, as in the COI proposal;  
> that's something I'd have to think about. It seems like a good idea. It 
> would involve heavier-weight inferencing than I had anticipated. In our 
> scenario, we would have limited ourselves to mappings of the type  
> natively accommodated in the D2R Map language, which are rather simple  
> correspondences. We assumed the results would be "approximate" and  
> require some degree of manual post-processing. With rules on board, in  
> addition to the server and a basic query translation engine, a policy  
> enforcement module, and a text processor, you'd also have to be running a 
> rules engine. Still, it might not matter much; system load might be a few 
> dozen queries a day, — but definitely not a few dozen a second!!!!

Actually, I'm hoping that the parse time and the query mapping time
will still be insigificant. I have an implementation of SPASQL that
has basically no performance overhead (some queries parse faster in
SQL, others in SPARQL), but that impl parses directly to an execution
plan (set of binary objects representing joins and constraints that
the RDB executes like any other compiled query). The current version
produces an intermediate SPARQL compile tree, and then fires each rule
once. This is still very cheap (cost of a couple ECA rules instead
real chained inference), but may prove to be frustratingly tedious to
break up and order the rules to ensure completeness.

> Anyway, that's the basic idea.

Tx a zillion for sharing this. I wonder if I can get a SPASQL
deployment in the works to add to the chaos.

> John
>
>
>

-- 
-eric

office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Tuesday, 24 June 2008 13:52:59 UTC