Re: [HCLS] Challenges and goals for the HCLS Semantic Web community in the next years

On 08/11/2007, Matthias Samwald <samwald@gmx.at> wrote:
>
>
> As I will not be able to attend the F2F meeting in Boston this week, and the
> teleconference connection will probably not work (as teleconferences usually
> do), I have collected my thoughts on several aspects of our Semantic Web
> developments in the following text.
>
> Topics:
> GENERAL APPROACH AND DESIGN PHILOSOPHY
> WEB USER INTERFACES
> ONTOLOGICAL FOUNDATIONS
> DOMAIN ONTOLOGIES
> LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS
> MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES,
> TRUST
> COMMERCIALISATION STRATEGIES
>
> --------------
>
> GENERAL APPROACH AND DESIGN PHILOSOPHY
>
> ** Small incremental steps and legacy support VS. radically new approaches
> **
> I think the community should become less reluctant to apply Semantic Web
> technologies in radically new ways. For example, instead of describing
> digital resources which themselves describe entities of interest (such as
> database records in Uniprot describing proteins), we should focus on
> describing those entities of interest directly -- without taking a detour
> through describing database entries and other artifacts of the pre-Semantic
> Web era.
> Of course, there are cases where such 'legacy support' is needed for
> pragmatic reasons, but I think that in the majority of cases there is no
> practical advantage at all.
> RDF/OWL is not only a syntactically more flexible alternative to current
> database systems; it enables a whole new philosophy of how information can
> be organized. If we want to demonstrate the real advantages of the Semantic
> Web, we need to be bold enough to break with current habits.

You risk breaking interoperability *within* the semantic web if you
insist on removing what are common accepted identifiers from real
objects. Attaching them to more realistic identifiers is an option,
but it needs to be an accepted strategy for each idenfier, such as
through the HUGO project, or it will become just another confusing
aspect to someone trying to come to terms with the whole issue.

> ** Focus solely on technical aspects VS. focus also on institutional /
> sociological / legal context **
> Many of the ideas inside our community cannot be realized when we solely
> focus on the technical aspects of Semantic Web technologies. We want to make
> significant change in the HCLS community happen, e.g., widespread use of
> structured digital abstracts or better communication between bench and
> bedside. Some of this work actually has nothing to do with Semantic Web
> technologies and is therefore outside the scope of the W3C HCLS interest
> group, so we might need to find other platforms to organize these things. I
> guess Science Commons (http://sciencecommons.org/) might become even more
> important for our work than it already is.

One issue I see with some of this work is that until something is
published, it is essentially unreviewed data, and sharing it with the
community does not always further the end goal of the semantic web,
ie, to integrate quality data from multiple sources together using
standard machine readable formats.

Other issues come with whether you are allowed to release data which
has been gained through the use of government or private funding
without it going through standard scientific review channels.

> --------------
>
> WEB USER INTERFACES
>
>  ** Flexible but unergonomic VS. inflexible but user friendly **
> This is a choice we are facing with any kind of user interface for a
> Semantic Web application. RDF/OWL is so flexible that it is very hard to
> create user interfaces to display arbitrary information in an appealing way.
> Many of the current RDF browsers produce lists of entities and relations
> that look raw and uninviting. This can be remedied by creating user
> interfaces that are specialized for certain domains, as we did with 'Entrez
> Neuron' (current prototype at http://gouda.med.yale.edu:8087/).
> Striking a balance between user friendliness and flexibility will be one of
> the most difficult problems we are facing in the development of GUIs.

Some of the current semantic web interfaces are the most spartan UI's
that I have seen in many years. Inflexible possibly means that the
user is restricted to doing only things which are deemed meaningful by
the author, and if they want to do more they are required to use a
more flexible by less intuitive interface to add to the possibilities
of the original user friendly approach. This may mean seeing dreaded
RDF statements or URI's...

> ** User interface ideas that should receive more attention **
> - Autocomplete fields / interfaces that motivate re-use of existing
> entities. Example: the newly started Okkam project (http://www.okkam.org/)
> is building an extension for Protégé to allow user to find existing entities
> that they can re-use. The Sindice project (http://sindice.com) provides a
> fast and scalable index of Semantic Web resources through a simple web API.

I see Protege as just a developer tool, although it has the necessary
backend componentry to drive a reasonable user application if a nice
UI was designed for it. Autocomplete hopefully does not mean complete
out a literal RDF triple. Hopefully user interfaces for the semantic
web can move past just using RDF triples as its basic element to
display information to users. Not everything is a triple, amazingly
enough...

> - Open query builders with a social component. The Leipzig DBpedia query
> interface (http://wikipedia.aksw.org/) is a nice prototype. More
> knowledgeable users can create queries from scratch and share them with
> others, less knowledgeable users can pick existing queries and make some
> minor modifications for their needs. Such a system could make use of social
> dynamics, e.g., rating of queries to rank the most useful ones first;
> profiling of user interests to suggest those queries that cater to the needs
> of specific user groups. The Leipzig query interface also demonstrates the
> usefulness of the auto-complete feature.

Wikipedia is a sketchy backend, particularly as some community groups
within it are refusing to use the infoboxes that DBpedia seems to rely
on. Particularly one of their featured interfaces relies on the
community of music composer enthusiasts on wikipedia returning
infoboxes to all the classical music related pages.
(http://wikipedia.aksw.org/index.php?qid=28) The queries also look
very spartan currently. Utilising an RDF feed from one of their
queries into another interface would be an interesting idea though.

> - Semantic Web Pipes / modularized RDF/OWL data flows. Such systems could be
> inspired by Yahoo Pipes, and could also have a social component. A prototype
> of such a system is http://pipes.deri.org/

Yahoo Pipes (http://pipes.yahoo.com) is a reasonably nice interface
for mashing up semantic web content, and if combined with a nice set
of RDFisers for the rest of the web could be the direction for linking
the web together. Extensions using their framework, could be made up
for any number of subject areas, as long as they havn't presumed RSS
to be the only usable format at too deep a level.

Pipes.deri.org looks reasonably confusing... If they had a graphical
interface it would be more obvious whether there were pipes in it, as
right now it just looks like a web (text) editor for low level XML
statements. Everything has to develop though.

The concept of "filtering" knowledge, as opposed to searching for it,
is an all too common element of semantic web interfaces that I don't
think is overly intuitive.

> - Interfaces resembling text editors. Such interfaces could enable a much
> faster way of creating and querying RDF/OWL compared to current ontology
> editors. Of course, they need to offer the user assistance in the form of
> auto-completion, type checking, text formatting etc. I made a prototype of
> such an interface at: http://neuroscientific.net/leeet/

Creating OWL documents is a reasonably complex business. I would
rather the Protege method over a text editor, which I would consider
to be a backwards step in general, even if it is easier for a few
knowledgable people to do some things. But each to their own I guess.

> - Semantic Wikis based on RDF/OWL triple stores. The best example at the
> moment is OntoWiki (http://ontowiki.net/).  Such dedicated Wiki systems
> should be distinguished from systems  that merely add a thin layer of RDF on
> top of a normal, text based Wiki system (like Semantic MediaWiki). The
> latter are not suitable to support the creation of large, consistent RDF/OWL
> knowledge bases, in my opinion.

I don't like the method that Semantic MediaWiki have to inserting
semantic statements into text documents, but on the other hand I see
the result as more generally useful in Web 2.0 terms than OntoWiki
which is too far focused on structured knowledge at the expense of
prose. OntoWiki type applications have their place where people want
just to collaboratively design a knowledge base which does not need to
have extensive prose sections or organisational elements.

> - Spreadsheets. Spreadsheets are very common tools for data entry /
> organization in science. Making an elegant and meaningful mapping between
> spreadsheets and biomedical domain ontologies possible would be an important
> goal. Again, the goal should not be to describe the structure and content of
> the spreadsheet in RDF, but rather to describe biomedical reality as
> directly as possible. http://rdf123.umbc.edu/ seems to be an interesting
> project in this area.

This would be very interesting for easily dealing with the MAGE-TAB
file format particularly, as they recently decide to depart from the
easier to process XML because it was harder to develop for end users.

> ** User interface ideas that turned out to be impractical and should receive
> less attention **
> - Graphs in almost any form and size.

I disagree... As I said before though, each to their own.

> - Emulating the interface of an ontology editor like Protégé inside the web
> browser.

Definitely agree there. People use the web to develop knowledge, not
to develop structures for knowledge.

>
> --------------
>
> ONTOLOGICAL FOUNDATIONS
>
> ** Heterogeneity reduction: unrestricted but heterogeneous VS. restricted
> but homogeneous (using foundational ontology) **
> If we look at the RDF/OWL datasets that are currently part of the 'HCLS demo'
> we can see that their structures are quite heterogeneous. Every data source
> is structured in a very unique way, so that someone writing a query spanning
> several data sources needs a deep understanding of each data source to make
> it work.

This is where RDFisers come into play. Particularly their use within
interfaces where people are not "creating queries" but rather
exploring within their subject area in a slightly more generic way. I
like SIMILE Exhibit in this respect but I am not sure how intuitive
their interface would be for an uninformed user. I guess it comes down
to knowledge about an area being necessary before you can retrieve
meaning from the related data.

> ** Granularity dependent VS. granularity independent **
> Granularity-dependent ontologies (such as BFO) force us to index each
> ontology to a certain granularity (like 'atom', 'molecule', 'cell',
> 'organism'). Things that are classified as an 'object' in one granularity
> are classified as an 'aggregate' in another granularity, placing them in
> disjoint class hierarchies and thereby making the integration across scales
> more difficult. Since such an integration across scales is probably one of
> our major targets, we may want to explore the advantages and disadvantages
> of ontologies that are granularity independent.

At the base level OWL-DL at least may not be designed to handle the
number of possibilities that real knowledge has and is understood
intuitively by humans because we think about things in a different
way. For instance, we have no trouble thinking about my cat and cats
in the same sentence, and we can even use them, with possibly
questionable logic, but this is something you struggle to give to a
semantic reasoner to understand.

While that was not directly relevant to what you said, it shows the
realities of trying to design a useful structured format to represent
real world data with. I disagree that because reasoners cannot
interpret something that it is unuseful, or that we should fine tune
information so that it is no longer useful to humans. The concept of
emergence is very poorly dealt with in the current set of ontologies,
as they assume that everything can be reduced to its components, ie,
the conceptual object...

> ** Dealing with time: 3D VS. widespread reification of relations VS. 4D **
> The representation of time (or rather, the change of relations between
> entities during time) has received relatively little attention so far. Many
> ontologies we are currently using -- including those based on BFO -- are
> based on the '3D' perspective: Physical objects (e.g. proteins, persons)
> persist in time and do not have temporal parts. This causes problems when we
> are dealing with change over time, e.g., when we want to make the statement
> 'Eve - has hair colour - brown' at one point in time and 'Eve -has hair
> colour - grey' at another time. I can give examples from the HCLS domain if
> required.

You have to understand where each statement came, ie, what "version"
did it come from, before you can properly resolve these issues.
Currently it is presumed that databases can be updated as iterations
on an essentially constant universe.

> The only way to deal with this in many of our current ontologies would be to
> index each ontology to a certain time. Eve would have brown hair in one
> ontology and grey hair in another ontology. However, at the moment it is
> still quite undecided how such indexing would be practically implemented in
> RDF/OWL, and how much problems such indexing would cause for our goal of
> easy and widespread information integration. It is possible that our current
> ontologies lead us down a road where we will encounter a lot of trouble when
> we finally need to take care about time.

Easy widespread information integration requires one to acknowledge
which entities are time dependent. I would like to see a knowledge
base which didn't need to consider itself consistent before it gave
out information about items. If you could build such a system and make
it useful you would be most of the way to avoiding the difficulties we
currently have with objects and changes in their properties over time.

> Therefore, we should explore how temporal changes can be represented without
> some obscure mechanism of ontology indexing.

Make it an explicit part of the production process. Computers already
have this with their last modified properties on files. Version
control systems have it already with their realisation of the concept
of versions, and differences between versions.

> One possibility would be to reify most of the relations between entities and
> to attach a temporal index to each relation. However, this would add a lot
> of unnecessary complexity in cases where we actually do not care about
> temporal aspects.
> Another possibility (favored by me at the moment) would be to build 4D
> ontologies where physical objects can have temporal parts. For example, we
> could say that 'Eve at age 20' and 'Eve at age 60' are two temporal parts of
> Eve. The great advantage of this approach is that it keeps our ontologies
> simple when we do not want to care about temporal aspects. For example, we
> can simply say 'Eve has hair colour brown' now. When 40 years have passed,
> and we discover that Eve's hair has turned gray, we can refine our
> description of Eve by saying that the first Eve we described was merely one
> temporal part of her, and that there is another temporal part of Eve with
> gray hair.

How would you represent this formally?

> --------------
>
> DOMAIN ONTOLOGIES
>
> ** The role of human readable text inside datatype properties **
> Google demonstrates that querying unstructured documents might not be
> perfect, but it can often provide a very quick and intuitive mechanism for
> finding information. I have the feeling that the Semantic Web community is
> sometimes so focused on providing structured data/metadata that we forget
> about that unstructured information kept inside datatype properties is a
> useful target for mining/querying as well.
> Finding the right balance between explicit information in RDF triples and
> implicit information inside the values of datatype properties could turn out
> to be quite important.

I doubt that there is a "right balance" in the full sense of the word.
I like the current situation on wikipedia with semi-structured
elements in "infoboxes" but having freetext prose also. However, it
would depend on my use of the data as to what would be more
informative and useful to me between that and having an extensive
structured document which breaks down each piece of freetext into its
components for instance...

> ** Class vs. instance/individual **
> One should be aware that the distinction between class and individual is not
> an arbitrary syntactic choice. It should also not be confused with the use
> of 'class' and 'instance' in object oriented programming, or the distinction
> between 'schema' and 'data' in database systems.
> In almost all ontologies, individuals are things that are located at a
> certain space and time. In most of our projects, we do not want to make
> statements about a certain serotonin receptor protein we saw swimming in our
> Petri dish; rather, we want to be able to make general statements about
> certain classes of serotonin receptor proteins, which can be shared with and
> further refined by other participants of the HCLS community.
> One problem we have encountered with the extensive use of classes in some
> ontologies of the HCLS demo was that the underlying RDF graphs became very
> complicated.  This is caused by the representation of OWL class property
> restrictions in RDF. We should explore ways to lessen this problem, e.g., by
> creating simpler RDF representations for some OWL constructs.

Do you have an example of a document where it would have been equally
informative to represent information using a more compact format?

> ** Domain ontologies we need in the near future **
> - An ontologically consistent, OBO Foundry-compliant ontology for molecular
> interactions and pathway. BioPAX-OBO is a new development in that area.
> Personally I have also made some first developments in that area (e.g., the
> 'OBO Essentials' ontology).
> - An ontologically consistent, OBO Foundry-compliant ontology for microarray
> experiments

Have you spoken to the MGED people about this issue?

> - An ontology of proteins and protein structures (e.g.
> http://proteinontology.info/ ?)
>
> --------------
>
> LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS
>
> ** Focus on description of experimental procedures, interventions and
> results VS. focus on description of nature **
> Some projects in the HCLS community focus on describing the process of
> scientific investigation, experimental procedures and their results (e.g.
> OBI, http://obi.sourceforge.net/), while others focus on describing the
> objects of these investigations directly.
> To give a concrete example, we can describe protein expression either
> through describing a microarray assay ("a cell from organism X was
> extracted, pixels on the microarray corresponding to gene Y had value Z"),
> or by describing physiology ("organism X has part cell, gene Y mRNA has
> location cell, gene Y mRNA has concentration Z").
> In my opinion, the consistent description of should have a higher priority
> than the description of experimental procedures. After all, our resources to
> generate structured data are limited, and we should focus our energies on
> describing our objects of investigation rather than every detail of our work
> in the lab.

If you omit details about how you got to your conclusion people cannot
verify it. I would imagine that this would be akin to people not
signing off on how they got to a result and expecting people to just
agree with them about it. If the process of generating structured data
was more simple then it would not need an ontology expert to perform
the action and then the resource problem may be eliminated.

> --------------
>
> MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES,
> TRUST
>
> ** Trust: coarse, location based VS. fine-grained, statement based. **
> I think that rather than implementing the complicated trust metrics
> described in academic publications over the recent years (fine-grained
> networks of trust, based on RDF), we will probably implement much simpler
> mechanisms to determine whether a piece of RDF/OWL we encounter on the web
> is trustworthy or not. Just like on the current web, trust will be mostly
> based on the location of the RDF/OWL resource, i.e. on the server. Some
> central websites will bundle some resources in central indices, users will
> choose between those central websites and different 'perspectives' of the
> resources on the global Semantic Web.
>
> ** Identifiers: huge sameAs services VS. strict enforcement of reuse of
> existing entities **
> We are currently steering towards a Semantic Web with a high degree of
> redundancy in terms of identifiers / URIs. URIs for things that are
> essentially the same are being generated with a breathtaking pace, and a
> mapping between these entities is often not technically feasible (who wants
> to load a mapping file between Uniprot record URIs minted by Science Commons
> and those minted by Uniprot itself?).
> This problem has two causes:
> - technically, it is often quite hard to find existing resources. This needs
> to be addressed by the creation of services that allow for the quick
> retrieval of existing resources during ontology creation (this is the goal
> of http://sindice.com or the OKKAM project).
> - socially, many people are very reluctant to re-use entities that have a
> URI with a foreign namespace. This problem is still underestimated and has
> already done a lot of damage to the development of the Semantic Web. The use
> of PURLs eases this problem a bit, as they are perceived as a more neutral
> ground. Personally, I still believe that the use of completely opaque URIs
> (like 'urn:uuid:c2f41010-65b3-11d1-a29f-00aa00c14882') might be an
> interesting option, although this would be against the principles of the
> 'open linked data' initiative.

It kind of depends on how you conceptualise a URI. When the semantic
web designers finally move past ontology design into large scale use
some of these issues may become irrelevant. I personally do not like
long arbitrary strings (for the sake of being arbitrary) because they
are very focused on a specific purpose. Kind of like how binary
document data can sometimes only be read by one application, and if
you upgrade the application or no longer have access to it, you can no
longer get any use at all out of the document. At least with
http://blah.com/gi:543556 you can conceptualise gi as being a
namespace and attempt to search for it in other ways.

I wouldn't say that purl.org is the answer to the namespace issue. If
you are worried about these issues and you look into it,
purl.org/blah/ domain is still operated by specific entities, who are
free to redirect you to the final website using their rules. What is
purl.org goes down for instance... And should purl.org addresses for
the semantic web return html, rdf? Who decides...

> Many of these and other questions are addressed in Jonathans texts about
> URIs (http://sw.neurocommons.org/2007/uri-note/).
>
> --------------
>
> COMMERCIALISATION STRATEGIES
>
> ** Of course, Semantic Web technologies can be used both on the public
> internet as well as the intranet of organizations (laboratory,
> pharmaceutical industry, healthcare providers). However, I am interested in
> the possibility of making the public, global Semantic Web commercially
> useful. We should think about scenarios where the value of the Semantic Web
> for commercial enterprises is not solely based on using the technologies
> locally, but also on becoming part of the global HCLS Semantic Web
> community; donating Semantic Web resources where possible and, at the same
> time, profiting from the donations of others. An open source data and
> knowledge economy.

Open source data and knowledge economy essentially doesn't make sense.
Open source emphasises the fact that no one owns the data, while
economy encourages people to trade their resources for value.

Currently drug companies are already using the extensive knowledge
bases, but they also keep their own knowledge bases hidden because of
the threat to their investments posed by an open data environment.

> The role of a Semantic Web company in such a scenario would not only be to
> tailor software applications to the specific needs of customers (i.e. HCLS
> institutions), but also to help customers become a 'good citizen' of the
> global Semantic Web - for their own benefit.

Having all companies only operating as service providers is an
interesting challenge, but one which doesn't necessarily pay for the
extensive research and development that you need to get to a place
where you can offer services, and it doesn't pay for the cost of
keeping open data sources open. How many open source companies make
money without offering commercial produce licenses to their customers?

> ** Revenue from advertisements **
> Revenue through targeted advertisements on websites is financing large
> portions of the current public web. Non-governmental institutions that plan
> to offer information resource on the Semantic Web need to be able to get
> some revenue from placing advertisements. In some cases, e.g., when the
> information is not offered through some HTML page but through a SPARQL
> endpoint, it is currently difficult to place targeted advertisements. It is
> important for the sustained growth of the public Semantic Web to explore
> strategies for placing advertisements in such scenarios.

Sparql endpoints are traditionally not thought of as places where
arbitrary information can be given out, such as advertising. Do
customers want RDF graphs full of advertisements? How do you plan on
filtering such information while keeping with agreements not to filter
it in return for getting the content for free?

> Because information and context is much more explicit in Semantic Web
> resources than on normal web pages, the potential for targeted advertising
> (similar to Google's AdSense) is huge.

Potential is high for user-readable information, but at the machine
level, where advertising is irrelevant anyway, it would be very
confusing.

> ------------------------------
> ------------------------------
>
>
> Many of the items are presented as choices between A or B, as I have the
> impression that such bold distinctions encourage feedback. Of course, most
> of them are not either-or choices but are rather continua where the best
> solution lies somewhere in between (but not necessarily in the middle).
> If there is interest in some of these topics, please reply so they can be
> discussed in more detail. If anyone is interested in extending this
> unorganized note to a publishable review or some W3C document, I would be
> happy to participate.

Overall not a bad document. It suffers due to the actual quality of
applications and development in the Semantic web not offering enough
concrete examples to provide useful scenarios, but that is the overall
issue I guess... Which came first, the chicken or the egg? :)

> Cheers,
> Matthias Samwald
>
> ---
> About me: http://neuroscientific.net/curriculum
>
>
>
>

Received on Thursday, 8 November 2007 03:05:18 UTC