Project with D2R

Hi Eric et al,

I'm pleased you asked about the project with D2R. I'll describe it. I  
have no idea whether it will get funded. It's not nearly as ambitious  
a project as the one COI is proposing, but it may be of interest.

Our use case is also Clinical Trials. The scenario we're targeting is,  
you have one or more Clinical Trials Management Organizations (CTMOs),  
each managing one or more clinical trials, enrolling patients at  
multiple clinical sites. At any given site, multiple trials and  
multiple CTMOs may be active simultaneously or sequentially. A given  
patient may participate simultaneously or sequentially in more than  
one trial.

Each trial has its own data requirements, that consists about 80% of  
participant data that will get prospectively collected in the course  
of the trial using dedicated, trial-specific data entry instruments  
(trial-specific forms, screens, etc.). But about 20% of the data is  
retrospective patient data needed to establish baseline demographics  
and "pre-existing conditions" (which is clinical trials lingo for  
"what other diseases/conditions does the patient have?") etc. For a  
given patient, this data would be pretty much the same from trial to  

This retrospective data "lives" in existing electronic medical records  
stored in existing clinical data stores (almost always relational DBs  
that are the backends to the clinic/hospital electronic medical record  
(EMR) application/system). Mostly, these have proprietary table  
designs that are specific to the EMR vendor. Although these designs  
may sometimes exploit standard coding systems as table keys, just as  
often they use entirely vendor/system-local keys.

Traditionally, when patients have been enrolled into a trial the CTMO  
sends a data management specialist to each site, who works with the  
local DB administrator to develop an extract-transform-load (ETL)  
scenario to get the required data out of the local DB into the trials  

But wouldn't it be much nicer if a trial administrator for a given  
CTMO could treat all the site-specific databases as if they were a  
single large database with a  common "virtual schema", and then he/she  
could formulate the ETL scenario just once (it wouldn't be ETL anymore  
then— it would just be a query). Even better, if all the CTMO's could  
form a consortium and agree to treat all the site-specific databases  
according to a single, shared "virtual schema", they all could benefit.

Important: This schema would cover only items of specific interest in  
this application, namely (a) demographic information and (b) pre- 
existing conditions. Thus, it would be what Vipul called a "niche  

The consortium structure is ideal for a web-based implementation,  
because then there's no overhead in setting up your infrastructure.  
You could, of course, formulate a virtual RDB schema to support  
queries, but why not take advantage of the web-friendliness and  
extensibility of RDF/OWL to specify your shared vocabulary, and  
formulate your queries in SPARQL.

A participating CTMO would transmit its SPARQL to participating sites  
(In a variant--and cooler--scenario, CTMO's don't even need to know  
who the participating sites are; the sites control their own  
participation by subscribing through a proxy to a feed that carries  
the query stream).

But what to do when the SPARQL hits the clinical sites? You install at  
each participating site a D2R server that listens for queries. For  
each site you have created a D2R mapping file covering RDF-to-SQL  
translation over the fairly limited shared vocabulary of interest  
(demographics, maybe 50 concepts; pre-existing conditions, maybe 100  
concepts). Creation of the mapping file is manual, labor-intensive and  
unique to each site, but only needs to be done once.

The participating site agrees to let the D2R engine be a database user  
with appropriate privileges (but what is "appropriate"? -- this is  
very tricky. Since the Mr. D2R will be transmitting the data it pulls  
on to third parties (the CTMO's), this gets to difficult policy  
concerns regarding specificity of consent, and is a separate part of  
our proposal).

Another problem: Some of the pre-existing conditions data is not  
actually going to be represented in granular form in most clinical  
DBs; it will more often be embedded someplace in text blobs (e.g.  
problem lists, discharge summaries, etc.). So you may need to perform  
some tricks in formulating your SQL queries by incorporating some text  
searching, pipelining in some natural language processing steps, or  
worst case maintaining auxiliary full-text indexes on the object  

We never thought about rule-based mappings, as in the COI proposal;  
that's something I'd have to think about. It seems like a good idea.  
It would involve heavier-weight inferencing than I had anticipated. In  
our scenario, we would have limited ourselves to mappings of the type  
natively accommodated in the D2R Map language, which are rather simple  
correspondences. We assumed the results would be "approximate" and  
require some degree of manual post-processing. With rules on board, in  
addition to the server and a basic query translation engine, a policy  
enforcement module, and a text processor, you'd also have to be  
running a rules engine. Still, it might not matter much; system load  
might be a few dozen queries a day, — but definitely not a few dozen a  

Anyway, that's the basic idea.


Received on Tuesday, 24 June 2008 03:16:59 UTC