[HCLS] Challenges and goals for the HCLS Semantic Web community in the next years from Matthias Samwald on 2007-11-07 (public-semweb-lifesci@w3.org from November 2007)

From: Matthias Samwald <samwald@gmx.at>
Date: Wed, 7 Nov 2007 18:12:30 +0100
To: <public-semweb-lifesci@w3.org>
Cc: "Holger Stenzhorn" <holger.stenzhorn@ifomis.uni-saarland.de>, "Giovanni Tummarello" <giovanni.tummarello@deri.org>, "Handschuh, Siegfried" <siegfried.handschuh@deri.org>, "Klaus-Peter Adlassnig" <klaus-peter.adlassnig@meduniwien.ac.at>, "Andreas Blumauer \(Semantic Web Company\)" <a.blumauer@semantic-web.at>, "Fox, Ronan" <ronan.fox@deri.org>
Message-ID: <526CD84AE3F14084ADF82047D28F5C10@tessellate>
As I will not be able to attend the F2F meeting in Boston this week, and the 
teleconference connection will probably not work (as teleconferences usually 
do), I have collected my thoughts on several aspects of our Semantic Web 
developments in the following text.

Topics:
GENERAL APPROACH AND DESIGN PHILOSOPHY
WEB USER INTERFACES
ONTOLOGICAL FOUNDATIONS
DOMAIN ONTOLOGIES
LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS
MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES, 
TRUST
COMMERCIALISATION STRATEGIES

--------------

GENERAL APPROACH AND DESIGN PHILOSOPHY

** Small incremental steps and legacy support VS. radically new approaches 
**
I think the community should become less reluctant to apply Semantic Web 
technologies in radically new ways. For example, instead of describing 
digital resources which themselves describe entities of interest (such as 
database records in Uniprot describing proteins), we should focus on 
describing those entities of interest directly -- without taking a detour 
through describing database entries and other artifacts of the pre-Semantic 
Web era.
Of course, there are cases where such 'legacy support' is needed for 
pragmatic reasons, but I think that in the majority of cases there is no 
practical advantage at all.
RDF/OWL is not only a syntactically more flexible alternative to current 
database systems; it enables a whole new philosophy of how information can 
be organized. If we want to demonstrate the real advantages of the Semantic 
Web, we need to be bold enough to break with current habits.

** Focus solely on technical aspects VS. focus also on institutional / 
sociological / legal context **
Many of the ideas inside our community cannot be realized when we solely 
focus on the technical aspects of Semantic Web technologies. We want to make 
significant change in the HCLS community happen, e.g., widespread use of 
structured digital abstracts or better communication between bench and 
bedside. Some of this work actually has nothing to do with Semantic Web 
technologies and is therefore outside the scope of the W3C HCLS interest 
group, so we might need to find other platforms to organize these things. I 
guess Science Commons (http://sciencecommons.org/) might become even more 
important for our work than it already is.

--------------

WEB USER INTERFACES

 ** Flexible but unergonomic VS. inflexible but user friendly **
This is a choice we are facing with any kind of user interface for a 
Semantic Web application. RDF/OWL is so flexible that it is very hard to 
create user interfaces to display arbitrary information in an appealing way. 
Many of the current RDF browsers produce lists of entities and relations 
that look raw and uninviting. This can be remedied by creating user 
interfaces that are specialized for certain domains, as we did with 'Entrez 
Neuron' (current prototype at http://gouda.med.yale.edu:8087/).
Striking a balance between user friendliness and flexibility will be one of 
the most difficult problems we are facing in the development of GUIs.

** User interface ideas that should receive more attention **
- Autocomplete fields / interfaces that motivate re-use of existing 
entities. Example: the newly started Okkam project (http://www.okkam.org/) 
is building an extension for Protégé to allow user to find existing entities 
that they can re-use. The Sindice project (http://sindice.com) provides a 
fast and scalable index of Semantic Web resources through a simple web API.
- Open query builders with a social component. The Leipzig DBpedia query 
interface (http://wikipedia.aksw.org/) is a nice prototype. More 
knowledgeable users can create queries from scratch and share them with 
others, less knowledgeable users can pick existing queries and make some 
minor modifications for their needs. Such a system could make use of social 
dynamics, e.g., rating of queries to rank the most useful ones first; 
profiling of user interests to suggest those queries that cater to the needs 
of specific user groups. The Leipzig query interface also demonstrates the 
usefulness of the auto-complete feature.
- Semantic Web Pipes / modularized RDF/OWL data flows. Such systems could be 
inspired by Yahoo Pipes, and could also have a social component. A prototype 
of such a system is http://pipes.deri.org/
- Interfaces resembling text editors. Such interfaces could enable a much 
faster way of creating and querying RDF/OWL compared to current ontology 
editors. Of course, they need to offer the user assistance in the form of 
auto-completion, type checking, text formatting etc. I made a prototype of 
such an interface at: http://neuroscientific.net/leeet/
- Semantic Wikis based on RDF/OWL triple stores. The best example at the 
moment is OntoWiki (http://ontowiki.net/).  Such dedicated Wiki systems 
should be distinguished from systems  that merely add a thin layer of RDF on 
top of a normal, text based Wiki system (like Semantic MediaWiki). The 
latter are not suitable to support the creation of large, consistent RDF/OWL 
knowledge bases, in my opinion.
- Spreadsheets. Spreadsheets are very common tools for data entry / 
organization in science. Making an elegant and meaningful mapping between 
spreadsheets and biomedical domain ontologies possible would be an important 
goal. Again, the goal should not be to describe the structure and content of 
the spreadsheet in RDF, but rather to describe biomedical reality as 
directly as possible. http://rdf123.umbc.edu/ seems to be an interesting 
project in this area.

** User interface ideas that turned out to be impractical and should receive 
less attention **
- Graphs in almost any form and size.
- Emulating the interface of an ontology editor like Protégé inside the web 
browser.

--------------

ONTOLOGICAL FOUNDATIONS

** Heterogeneity reduction: unrestricted but heterogeneous VS. restricted 
but homogeneous (using foundational ontology) **
If we look at the RDF/OWL datasets that are currently part of the 'HCLS demo' 
we can see that their structures are quite heterogeneous. Every data source 
is structured in a very unique way, so that someone writing a query spanning 
several data sources needs a deep understanding of each data source to make 
it work.

** Granularity dependent VS. granularity independent **
Granularity-dependent ontologies (such as BFO) force us to index each 
ontology to a certain granularity (like 'atom', 'molecule', 'cell', 
'organism'). Things that are classified as an 'object' in one granularity 
are classified as an 'aggregate' in another granularity, placing them in 
disjoint class hierarchies and thereby making the integration across scales 
more difficult. Since such an integration across scales is probably one of 
our major targets, we may want to explore the advantages and disadvantages 
of ontologies that are granularity independent.

** Dealing with time: 3D VS. widespread reification of relations VS. 4D **
The representation of time (or rather, the change of relations between 
entities during time) has received relatively little attention so far. Many 
ontologies we are currently using -- including those based on BFO -- are 
based on the '3D' perspective: Physical objects (e.g. proteins, persons) 
persist in time and do not have temporal parts. This causes problems when we 
are dealing with change over time, e.g., when we want to make the statement 
'Eve - has hair colour - brown' at one point in time and 'Eve -has hair 
colour - grey' at another time. I can give examples from the HCLS domain if 
required.
The only way to deal with this in many of our current ontologies would be to 
index each ontology to a certain time. Eve would have brown hair in one 
ontology and grey hair in another ontology. However, at the moment it is 
still quite undecided how such indexing would be practically implemented in 
RDF/OWL, and how much problems such indexing would cause for our goal of 
easy and widespread information integration. It is possible that our current 
ontologies lead us down a road where we will encounter a lot of trouble when 
we finally need to take care about time.
Therefore, we should explore how temporal changes can be represented without 
some obscure mechanism of ontology indexing.
One possibility would be to reify most of the relations between entities and 
to attach a temporal index to each relation. However, this would add a lot 
of unnecessary complexity in cases where we actually do not care about 
temporal aspects.
Another possibility (favored by me at the moment) would be to build 4D 
ontologies where physical objects can have temporal parts. For example, we 
could say that 'Eve at age 20' and 'Eve at age 60' are two temporal parts of 
Eve. The great advantage of this approach is that it keeps our ontologies 
simple when we do not want to care about temporal aspects. For example, we 
can simply say 'Eve has hair colour brown' now. When 40 years have passed, 
and we discover that Eve's hair has turned gray, we can refine our 
description of Eve by saying that the first Eve we described was merely one 
temporal part of her, and that there is another temporal part of Eve with 
gray hair.

--------------

DOMAIN ONTOLOGIES

** The role of human readable text inside datatype properties **
Google demonstrates that querying unstructured documents might not be 
perfect, but it can often provide a very quick and intuitive mechanism for 
finding information. I have the feeling that the Semantic Web community is 
sometimes so focused on providing structured data/metadata that we forget 
about that unstructured information kept inside datatype properties is a 
useful target for mining/querying as well.
Finding the right balance between explicit information in RDF triples and 
implicit information inside the values of datatype properties could turn out 
to be quite important.

** Class vs. instance/individual **
One should be aware that the distinction between class and individual is not 
an arbitrary syntactic choice. It should also not be confused with the use 
of 'class' and 'instance' in object oriented programming, or the distinction 
between 'schema' and 'data' in database systems.
In almost all ontologies, individuals are things that are located at a 
certain space and time. In most of our projects, we do not want to make 
statements about a certain serotonin receptor protein we saw swimming in our 
Petri dish; rather, we want to be able to make general statements about 
certain classes of serotonin receptor proteins, which can be shared with and 
further refined by other participants of the HCLS community.
One problem we have encountered with the extensive use of classes in some 
ontologies of the HCLS demo was that the underlying RDF graphs became very 
complicated.  This is caused by the representation of OWL class property 
restrictions in RDF. We should explore ways to lessen this problem, e.g., by 
creating simpler RDF representations for some OWL constructs.

** Domain ontologies we need in the near future **
- An ontologically consistent, OBO Foundry-compliant ontology for molecular 
interactions and pathway. BioPAX-OBO is a new development in that area. 
Personally I have also made some first developments in that area (e.g., the 
'OBO Essentials' ontology).
- An ontologically consistent, OBO Foundry-compliant ontology for microarray 
experiments
- An ontology of proteins and protein structures (e.g. 
http://proteinontology.info/ ?)

--------------

LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS

** Focus on description of experimental procedures, interventions and 
results VS. focus on description of nature **
Some projects in the HCLS community focus on describing the process of 
scientific investigation, experimental procedures and their results (e.g. 
OBI, http://obi.sourceforge.net/), while others focus on describing the 
objects of these investigations directly.
To give a concrete example, we can describe protein expression either 
through describing a microarray assay ("a cell from organism X was 
extracted, pixels on the microarray corresponding to gene Y had value Z"), 
or by describing physiology ("organism X has part cell, gene Y mRNA has 
location cell, gene Y mRNA has concentration Z").
In my opinion, the consistent description of should have a higher priority 
than the description of experimental procedures. After all, our resources to 
generate structured data are limited, and we should focus our energies on 
describing our objects of investigation rather than every detail of our work 
in the lab.

--------------

MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES, 
TRUST

** Trust: coarse, location based VS. fine-grained, statement based. **
I think that rather than implementing the complicated trust metrics 
described in academic publications over the recent years (fine-grained 
networks of trust, based on RDF), we will probably implement much simpler 
mechanisms to determine whether a piece of RDF/OWL we encounter on the web 
is trustworthy or not. Just like on the current web, trust will be mostly 
based on the location of the RDF/OWL resource, i.e. on the server. Some 
central websites will bundle some resources in central indices, users will 
choose between those central websites and different 'perspectives' of the 
resources on the global Semantic Web.

** Identifiers: huge sameAs services VS. strict enforcement of reuse of 
existing entities **
We are currently steering towards a Semantic Web with a high degree of 
redundancy in terms of identifiers / URIs. URIs for things that are 
essentially the same are being generated with a breathtaking pace, and a 
mapping between these entities is often not technically feasible (who wants 
to load a mapping file between Uniprot record URIs minted by Science Commons 
and those minted by Uniprot itself?).
This problem has two causes:
- technically, it is often quite hard to find existing resources. This needs 
to be addressed by the creation of services that allow for the quick 
retrieval of existing resources during ontology creation (this is the goal 
of http://sindice.com or the OKKAM project).
- socially, many people are very reluctant to re-use entities that have a 
URI with a foreign namespace. This problem is still underestimated and has 
already done a lot of damage to the development of the Semantic Web. The use 
of PURLs eases this problem a bit, as they are perceived as a more neutral 
ground. Personally, I still believe that the use of completely opaque URIs 
(like 'urn:uuid:c2f41010-65b3-11d1-a29f-00aa00c14882') might be an 
interesting option, although this would be against the principles of the 
'open linked data' initiative.

Many of these and other questions are addressed in Jonathans texts about 
URIs (http://sw.neurocommons.org/2007/uri-note/).

--------------

COMMERCIALISATION STRATEGIES

** Of course, Semantic Web technologies can be used both on the public 
internet as well as the intranet of organizations (laboratory, 
pharmaceutical industry, healthcare providers). However, I am interested in 
the possibility of making the public, global Semantic Web commercially 
useful. We should think about scenarios where the value of the Semantic Web 
for commercial enterprises is not solely based on using the technologies 
locally, but also on becoming part of the global HCLS Semantic Web 
community; donating Semantic Web resources where possible and, at the same 
time, profiting from the donations of others. An open source data and 
knowledge economy.
The role of a Semantic Web company in such a scenario would not only be to 
tailor software applications to the specific needs of customers (i.e. HCLS 
institutions), but also to help customers become a 'good citizen' of the 
global Semantic Web - for their own benefit.

** Revenue from advertisements **
Revenue through targeted advertisements on websites is financing large 
portions of the current public web. Non-governmental institutions that plan 
to offer information resource on the Semantic Web need to be able to get 
some revenue from placing advertisements. In some cases, e.g., when the 
information is not offered through some HTML page but through a SPARQL 
endpoint, it is currently difficult to place targeted advertisements. It is 
important for the sustained growth of the public Semantic Web to explore 
strategies for placing advertisements in such scenarios.
Because information and context is much more explicit in Semantic Web 
resources than on normal web pages, the potential for targeted advertising 
(similar to Google's AdSense) is huge.

------------------------------
------------------------------


Many of the items are presented as choices between A or B, as I have the 
impression that such bold distinctions encourage feedback. Of course, most 
of them are not either-or choices but are rather continua where the best 
solution lies somewhere in between (but not necessarily in the middle).
If there is interest in some of these topics, please reply so they can be 
discussed in more detail. If anyone is interested in extending this 
unorganized note to a publishable review or some W3C document, I would be 
happy to participate.

Cheers,
Matthias Samwald

---
About me: http://neuroscientific.net/curriculum
Received on Wednesday, 7 November 2007 17:13:13 UTC