A survey of work done within HCLS

There's a one-day CSAIL (comp sci and AI -- the lab which hosts W3C)
workshop where all the grad students and professors get together and
talk about their work. Oshani (Cc'd) is organizing this event and I
told her I'd stand up for 20 or 30 mins (I forget) to talk about HCLS
with the goal of enticing students to work with us. Following is a
brief outline (expected to be flushed to two pages by the end of the
day) of what we can write in the proceedings. Any help or prepared
material greatly appreciated. Likewise, guidance from Oshani on what
would be useful to have in the proceedings. @@ indicates that I don't
just want input, I neeeed it.

Intro:
W3C is an international industrial standards organization. Where IETF
standardizes internet wire protocols, we standardize web payloads. You
may have heard of HTML, XML, Semantic Web...

We cover work in a broad set of domains, Interaction, Ubiquitous Web,
Accessibility, and the catch-all, Technology and Society. Following is
an introduction to the work of one group, the Semantic Web in Health
Care and Life Sciences Interest Group (HCLS IG).

As the name suggests, the folks in the HCLS group are focused on the
application of Semantic Web technologies to the challenges in their
domains. The participants come from:
  life sciences: proteomics, neurology, genetics
  health care: hospitals, clinics, insurance companies
  and everything in between: pharmaceuticals, clinical research organizations

Each of these concentrations incurs large costs when classifying and
sharing their copius knowledge, and when integrating data with
conceptually adjacent concentrations. This leads to losses in money,
productivity and satisfaction (no one enjoys using post-its to do work
that a computer should do). One of the principle obstacles to sharing
knowledge is sharing a coding system. If I say
  \1People\04\0ID\0FN\0LN\0ADR\0\212817\0Eric\0Prud'hommeaux\023\0
you might, with sufficient interest and patience, guess that I'm
dumping some binary databse. A form like
  <People>
    <ID>12817</ID><FN>Eric</FN><LN>Prud'hommeaux</LN><ADR href="#a23">
  ...
is more human-parsable, but doesn't tell you if this is the same Eric
as any other Eric on the web ("12817" is still ambiguous). The
Semantic Web combines simple declarations using URIs for
disambiguation with a culture of schema re-use and extension for
maximum interoperability. Here is how the HCLS IG is applying these
ideas:

Terminology:
The foundation of unambiguous statements is unambiguous
terms. Consistent, sharable identifiers connect the vertexes of
conceptually adjacent graphs, providing a spanning schema with very
little coding effort. That's the ideal world. In reality, any coding
system takes effort, and the real engineering comes in making a system
where, by carrot or stick, people or systems are incented to find the
correct term for e.g. a protein receptor or an increased pulmanary
adema due to failing atrioventricular valve. Because we want to use
the existing infrastructure and corpus of data, we need to re-use
existing term sets and extend them to give us unambiguous semantics
which machines can use.

There are about 20 @@JohnM? medical and anatomical terms sets in
popular use in clinics today. They've largely grown organically, with
insufficent mechanism to prevent either duplication or ambiguous
definitions. Given different use cases, they've captured different
levels of formal relationships between the terms. For instance, many
SNOMED terms are related by an |isa| relationship, but that stands for
both |type| and |sub class| (as well as a few other terms). 

Different use cases motivate different intimacies of models. For
instance, SNOMED can be expressed in the Semantic Web by simply
quoting this noncommittal |isa|, or we can express *some* of these isa
relationships as inherently transitive subclass relationships. SNOMED
has been expressed in very general non-transitive languages and in
intimate description logic languages which can help you debug your
model by discovering inconsistancies and unsatisfiable classes.


BioRDF:
This task force started by contributing neurology and micro-anatomical
data to a large data warehouse with the goal of answering drug
discovery queries. This work involved the construction modeling of
existing databases as RDF and the mechanics of converting and dumping
that data into this materialized view of the semantic web.

The group is continuing with the modeling aspect, though now the
conversion is done by having Semantic Web query wrappers around
existing databases, reducing storage and latency. This work has
inspired extensions to the SPARQL query language, which should be
incorporated into the standard within a year.

Linking Open Drug Data:
Where the BioRDF Task Force focuses on neuroscience queries, the LODD
Task Force focuses on expressing the masses of publicly available
(pharmaceutical) drug data in the Semantic Web. While the FDA has
collected this data as part of the drug approval process, the data has
never been colated in a consistent form and has had very little use
beyond providing a paper trail for liability cases.

Building on the Linked Open Data project, the LODD extends this large
crystal to enable use cases like longitudinal studies of drug safety
and efficacy. The data comes from public sources like
|clinicaltrials.gov| as well as private contributions like |@@lilly's
data@@| and depends on the LOD cloud for terms for e.g. drugs, drug
classes, chemical compounds, etc.

Clinical Observations Interoperability:
Selecting the correct patients for a clinical study is critical to
measuring the safety and efficacy of the drug. While hospitals and
clinics have most of the data needed to find candidates, conventional
approaches are hampered by the diversity of schemas and insufficiently
intimate security models (often no access, full access, or access to
expensive anonymized dumps). This task force has used a simple
language to translate real hospital data to SemWeb-friendly views,
mapping from the relational database to a shared ontology based on the
HL7 RIM standards.

The mapping language enables query language, which creates a virtual
view of the database, but available on the Semantic Web in a number of
popular shared schemas.  This task force produced a pipeline in which
a researcher was able to compose a query in researcher-speak, a rule
translated that to hospital-speak, another rule translated it to the
schema for an individual hospital, and finally the query was expressed
and executed as SQL.

The group is not developing a security model using the same mapping
language, providing a correspondance between security levels in the
virtual views and those mandated by law and reallized conventionally
enforced in XACML.

Translational Medicine Ontology:
Translational medicine is an area of pharmacology which incorporates
data from a wide set of sources. The goal of "getting the right
medication to the right person at the right time" requires access to
many aspects of the patient's health, physiology and behavoir,
possible chemical and bilogical reactions associated with the
candidate medication, patient's diet and metabolism, and the history
of data gathered during drug studies and post-market data
acquisition. Translational medicine is perhaps the ultimate data
integration use case.


This task force is drawing on expertise from several pharmaceuticals
to create a network of ontologies. Starting from a set of health care
roles and the questions they would ask, the group is creating the
infrastrucure, both conceptual and programmatic, to answer these life-
saving questions.

Scientific Discourse:
The Alzforum <http://www/alzforum.org/> has provided researchers with
a gathering and dissemination point which has become the focal point
(@@too strong?) of Alzheimer Research. The Drupal plugin Science
Collaboration Framework provides the core functionality for that
system, as well as providing a testbed for how increased coding can
improve the utility and user experience.

At the core is a scientific discourse ontology which describes
theories, citations, hypotheses and evidence. This is tied to a
popular Semantic Web schema for associating persona with publications,
including modern variants like blogs and wiki articles. The product is
a representation of supporting and conflicting theories, chains of
evidence, etc. Use cases of course include finding scientists with
certain areas of interest/expertise, as well as surprising ones like
finding necessary research areas based on conflicting theories.

Invitation:
You've seen a taste of one group at W3C, and perhaps have a taste of
what else we do. We have no shortage of interesting research ideas. I
invite any of you to come to contribute your own expertise and
insights. You can take formal steps by filling in
<http://www.w3.org/2002/09/wbs/1/ieapp/> and joining a working group,
or just come to the W3C ghetto to talk with us and get an idea about
what we do. We look forward to working with you.
-- 
-eric

office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.

Received on Thursday, 10 September 2009 17:30:25 UTC