- From: John Madden <madden.jf@gmail.com>
- Date: Mon, 23 Jun 2008 23:16:09 -0400
- To: Eric Prud'hommeaux <eric@w3.org>
- Cc: public-hcls-coi@w3.org
Hi Eric et al, I'm pleased you asked about the project with D2R. I'll describe it. I have no idea whether it will get funded. It's not nearly as ambitious a project as the one COI is proposing, but it may be of interest. Our use case is also Clinical Trials. The scenario we're targeting is, you have one or more Clinical Trials Management Organizations (CTMOs), each managing one or more clinical trials, enrolling patients at multiple clinical sites. At any given site, multiple trials and multiple CTMOs may be active simultaneously or sequentially. A given patient may participate simultaneously or sequentially in more than one trial. Each trial has its own data requirements, that consists about 80% of participant data that will get prospectively collected in the course of the trial using dedicated, trial-specific data entry instruments (trial-specific forms, screens, etc.). But about 20% of the data is retrospective patient data needed to establish baseline demographics and "pre-existing conditions" (which is clinical trials lingo for "what other diseases/conditions does the patient have?") etc. For a given patient, this data would be pretty much the same from trial to trial. This retrospective data "lives" in existing electronic medical records stored in existing clinical data stores (almost always relational DBs that are the backends to the clinic/hospital electronic medical record (EMR) application/system). Mostly, these have proprietary table designs that are specific to the EMR vendor. Although these designs may sometimes exploit standard coding systems as table keys, just as often they use entirely vendor/system-local keys. Traditionally, when patients have been enrolled into a trial the CTMO sends a data management specialist to each site, who works with the local DB administrator to develop an extract-transform-load (ETL) scenario to get the required data out of the local DB into the trials DB. But wouldn't it be much nicer if a trial administrator for a given CTMO could treat all the site-specific databases as if they were a single large database with a common "virtual schema", and then he/she could formulate the ETL scenario just once (it wouldn't be ETL anymore then— it would just be a query). Even better, if all the CTMO's could form a consortium and agree to treat all the site-specific databases according to a single, shared "virtual schema", they all could benefit. Important: This schema would cover only items of specific interest in this application, namely (a) demographic information and (b) pre- existing conditions. Thus, it would be what Vipul called a "niche ontology". The consortium structure is ideal for a web-based implementation, because then there's no overhead in setting up your infrastructure. You could, of course, formulate a virtual RDB schema to support queries, but why not take advantage of the web-friendliness and extensibility of RDF/OWL to specify your shared vocabulary, and formulate your queries in SPARQL. A participating CTMO would transmit its SPARQL to participating sites (In a variant--and cooler--scenario, CTMO's don't even need to know who the participating sites are; the sites control their own participation by subscribing through a proxy to a feed that carries the query stream). But what to do when the SPARQL hits the clinical sites? You install at each participating site a D2R server that listens for queries. For each site you have created a D2R mapping file covering RDF-to-SQL translation over the fairly limited shared vocabulary of interest (demographics, maybe 50 concepts; pre-existing conditions, maybe 100 concepts). Creation of the mapping file is manual, labor-intensive and unique to each site, but only needs to be done once. The participating site agrees to let the D2R engine be a database user with appropriate privileges (but what is "appropriate"? -- this is very tricky. Since the Mr. D2R will be transmitting the data it pulls on to third parties (the CTMO's), this gets to difficult policy concerns regarding specificity of consent, and is a separate part of our proposal). Another problem: Some of the pre-existing conditions data is not actually going to be represented in granular form in most clinical DBs; it will more often be embedded someplace in text blobs (e.g. problem lists, discharge summaries, etc.). So you may need to perform some tricks in formulating your SQL queries by incorporating some text searching, pipelining in some natural language processing steps, or worst case maintaining auxiliary full-text indexes on the object database. We never thought about rule-based mappings, as in the COI proposal; that's something I'd have to think about. It seems like a good idea. It would involve heavier-weight inferencing than I had anticipated. In our scenario, we would have limited ourselves to mappings of the type natively accommodated in the D2R Map language, which are rather simple correspondences. We assumed the results would be "approximate" and require some degree of manual post-processing. With rules on board, in addition to the server and a basic query translation engine, a policy enforcement module, and a text processor, you'd also have to be running a rules engine. Still, it might not matter much; system load might be a few dozen queries a day, — but definitely not a few dozen a second!!!! Anyway, that's the basic idea. John
Received on Tuesday, 24 June 2008 03:16:59 UTC