- From: Matthias Samwald <samwald@gmx.at>
- Date: Wed, 7 Nov 2007 18:12:30 +0100
- To: <public-semweb-lifesci@w3.org>
- Cc: "Holger Stenzhorn" <holger.stenzhorn@ifomis.uni-saarland.de>, "Giovanni Tummarello" <giovanni.tummarello@deri.org>, "Handschuh, Siegfried" <siegfried.handschuh@deri.org>, "Klaus-Peter Adlassnig" <klaus-peter.adlassnig@meduniwien.ac.at>, "Andreas Blumauer \(Semantic Web Company\)" <a.blumauer@semantic-web.at>, "Fox, Ronan" <ronan.fox@deri.org>
As I will not be able to attend the F2F meeting in Boston this week, and the teleconference connection will probably not work (as teleconferences usually do), I have collected my thoughts on several aspects of our Semantic Web developments in the following text. Topics: GENERAL APPROACH AND DESIGN PHILOSOPHY WEB USER INTERFACES ONTOLOGICAL FOUNDATIONS DOMAIN ONTOLOGIES LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES, TRUST COMMERCIALISATION STRATEGIES -------------- GENERAL APPROACH AND DESIGN PHILOSOPHY ** Small incremental steps and legacy support VS. radically new approaches ** I think the community should become less reluctant to apply Semantic Web technologies in radically new ways. For example, instead of describing digital resources which themselves describe entities of interest (such as database records in Uniprot describing proteins), we should focus on describing those entities of interest directly -- without taking a detour through describing database entries and other artifacts of the pre-Semantic Web era. Of course, there are cases where such 'legacy support' is needed for pragmatic reasons, but I think that in the majority of cases there is no practical advantage at all. RDF/OWL is not only a syntactically more flexible alternative to current database systems; it enables a whole new philosophy of how information can be organized. If we want to demonstrate the real advantages of the Semantic Web, we need to be bold enough to break with current habits. ** Focus solely on technical aspects VS. focus also on institutional / sociological / legal context ** Many of the ideas inside our community cannot be realized when we solely focus on the technical aspects of Semantic Web technologies. We want to make significant change in the HCLS community happen, e.g., widespread use of structured digital abstracts or better communication between bench and bedside. Some of this work actually has nothing to do with Semantic Web technologies and is therefore outside the scope of the W3C HCLS interest group, so we might need to find other platforms to organize these things. I guess Science Commons (http://sciencecommons.org/) might become even more important for our work than it already is. -------------- WEB USER INTERFACES ** Flexible but unergonomic VS. inflexible but user friendly ** This is a choice we are facing with any kind of user interface for a Semantic Web application. RDF/OWL is so flexible that it is very hard to create user interfaces to display arbitrary information in an appealing way. Many of the current RDF browsers produce lists of entities and relations that look raw and uninviting. This can be remedied by creating user interfaces that are specialized for certain domains, as we did with 'Entrez Neuron' (current prototype at http://gouda.med.yale.edu:8087/). Striking a balance between user friendliness and flexibility will be one of the most difficult problems we are facing in the development of GUIs. ** User interface ideas that should receive more attention ** - Autocomplete fields / interfaces that motivate re-use of existing entities. Example: the newly started Okkam project (http://www.okkam.org/) is building an extension for Protégé to allow user to find existing entities that they can re-use. The Sindice project (http://sindice.com) provides a fast and scalable index of Semantic Web resources through a simple web API. - Open query builders with a social component. The Leipzig DBpedia query interface (http://wikipedia.aksw.org/) is a nice prototype. More knowledgeable users can create queries from scratch and share them with others, less knowledgeable users can pick existing queries and make some minor modifications for their needs. Such a system could make use of social dynamics, e.g., rating of queries to rank the most useful ones first; profiling of user interests to suggest those queries that cater to the needs of specific user groups. The Leipzig query interface also demonstrates the usefulness of the auto-complete feature. - Semantic Web Pipes / modularized RDF/OWL data flows. Such systems could be inspired by Yahoo Pipes, and could also have a social component. A prototype of such a system is http://pipes.deri.org/ - Interfaces resembling text editors. Such interfaces could enable a much faster way of creating and querying RDF/OWL compared to current ontology editors. Of course, they need to offer the user assistance in the form of auto-completion, type checking, text formatting etc. I made a prototype of such an interface at: http://neuroscientific.net/leeet/ - Semantic Wikis based on RDF/OWL triple stores. The best example at the moment is OntoWiki (http://ontowiki.net/). Such dedicated Wiki systems should be distinguished from systems that merely add a thin layer of RDF on top of a normal, text based Wiki system (like Semantic MediaWiki). The latter are not suitable to support the creation of large, consistent RDF/OWL knowledge bases, in my opinion. - Spreadsheets. Spreadsheets are very common tools for data entry / organization in science. Making an elegant and meaningful mapping between spreadsheets and biomedical domain ontologies possible would be an important goal. Again, the goal should not be to describe the structure and content of the spreadsheet in RDF, but rather to describe biomedical reality as directly as possible. http://rdf123.umbc.edu/ seems to be an interesting project in this area. ** User interface ideas that turned out to be impractical and should receive less attention ** - Graphs in almost any form and size. - Emulating the interface of an ontology editor like Protégé inside the web browser. -------------- ONTOLOGICAL FOUNDATIONS ** Heterogeneity reduction: unrestricted but heterogeneous VS. restricted but homogeneous (using foundational ontology) ** If we look at the RDF/OWL datasets that are currently part of the 'HCLS demo' we can see that their structures are quite heterogeneous. Every data source is structured in a very unique way, so that someone writing a query spanning several data sources needs a deep understanding of each data source to make it work. ** Granularity dependent VS. granularity independent ** Granularity-dependent ontologies (such as BFO) force us to index each ontology to a certain granularity (like 'atom', 'molecule', 'cell', 'organism'). Things that are classified as an 'object' in one granularity are classified as an 'aggregate' in another granularity, placing them in disjoint class hierarchies and thereby making the integration across scales more difficult. Since such an integration across scales is probably one of our major targets, we may want to explore the advantages and disadvantages of ontologies that are granularity independent. ** Dealing with time: 3D VS. widespread reification of relations VS. 4D ** The representation of time (or rather, the change of relations between entities during time) has received relatively little attention so far. Many ontologies we are currently using -- including those based on BFO -- are based on the '3D' perspective: Physical objects (e.g. proteins, persons) persist in time and do not have temporal parts. This causes problems when we are dealing with change over time, e.g., when we want to make the statement 'Eve - has hair colour - brown' at one point in time and 'Eve -has hair colour - grey' at another time. I can give examples from the HCLS domain if required. The only way to deal with this in many of our current ontologies would be to index each ontology to a certain time. Eve would have brown hair in one ontology and grey hair in another ontology. However, at the moment it is still quite undecided how such indexing would be practically implemented in RDF/OWL, and how much problems such indexing would cause for our goal of easy and widespread information integration. It is possible that our current ontologies lead us down a road where we will encounter a lot of trouble when we finally need to take care about time. Therefore, we should explore how temporal changes can be represented without some obscure mechanism of ontology indexing. One possibility would be to reify most of the relations between entities and to attach a temporal index to each relation. However, this would add a lot of unnecessary complexity in cases where we actually do not care about temporal aspects. Another possibility (favored by me at the moment) would be to build 4D ontologies where physical objects can have temporal parts. For example, we could say that 'Eve at age 20' and 'Eve at age 60' are two temporal parts of Eve. The great advantage of this approach is that it keeps our ontologies simple when we do not want to care about temporal aspects. For example, we can simply say 'Eve has hair colour brown' now. When 40 years have passed, and we discover that Eve's hair has turned gray, we can refine our description of Eve by saying that the first Eve we described was merely one temporal part of her, and that there is another temporal part of Eve with gray hair. -------------- DOMAIN ONTOLOGIES ** The role of human readable text inside datatype properties ** Google demonstrates that querying unstructured documents might not be perfect, but it can often provide a very quick and intuitive mechanism for finding information. I have the feeling that the Semantic Web community is sometimes so focused on providing structured data/metadata that we forget about that unstructured information kept inside datatype properties is a useful target for mining/querying as well. Finding the right balance between explicit information in RDF triples and implicit information inside the values of datatype properties could turn out to be quite important. ** Class vs. instance/individual ** One should be aware that the distinction between class and individual is not an arbitrary syntactic choice. It should also not be confused with the use of 'class' and 'instance' in object oriented programming, or the distinction between 'schema' and 'data' in database systems. In almost all ontologies, individuals are things that are located at a certain space and time. In most of our projects, we do not want to make statements about a certain serotonin receptor protein we saw swimming in our Petri dish; rather, we want to be able to make general statements about certain classes of serotonin receptor proteins, which can be shared with and further refined by other participants of the HCLS community. One problem we have encountered with the extensive use of classes in some ontologies of the HCLS demo was that the underlying RDF graphs became very complicated. This is caused by the representation of OWL class property restrictions in RDF. We should explore ways to lessen this problem, e.g., by creating simpler RDF representations for some OWL constructs. ** Domain ontologies we need in the near future ** - An ontologically consistent, OBO Foundry-compliant ontology for molecular interactions and pathway. BioPAX-OBO is a new development in that area. Personally I have also made some first developments in that area (e.g., the 'OBO Essentials' ontology). - An ontologically consistent, OBO Foundry-compliant ontology for microarray experiments - An ontology of proteins and protein structures (e.g. http://proteinontology.info/ ?) -------------- LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS ** Focus on description of experimental procedures, interventions and results VS. focus on description of nature ** Some projects in the HCLS community focus on describing the process of scientific investigation, experimental procedures and their results (e.g. OBI, http://obi.sourceforge.net/), while others focus on describing the objects of these investigations directly. To give a concrete example, we can describe protein expression either through describing a microarray assay ("a cell from organism X was extracted, pixels on the microarray corresponding to gene Y had value Z"), or by describing physiology ("organism X has part cell, gene Y mRNA has location cell, gene Y mRNA has concentration Z"). In my opinion, the consistent description of should have a higher priority than the description of experimental procedures. After all, our resources to generate structured data are limited, and we should focus our energies on describing our objects of investigation rather than every detail of our work in the lab. -------------- MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES, TRUST ** Trust: coarse, location based VS. fine-grained, statement based. ** I think that rather than implementing the complicated trust metrics described in academic publications over the recent years (fine-grained networks of trust, based on RDF), we will probably implement much simpler mechanisms to determine whether a piece of RDF/OWL we encounter on the web is trustworthy or not. Just like on the current web, trust will be mostly based on the location of the RDF/OWL resource, i.e. on the server. Some central websites will bundle some resources in central indices, users will choose between those central websites and different 'perspectives' of the resources on the global Semantic Web. ** Identifiers: huge sameAs services VS. strict enforcement of reuse of existing entities ** We are currently steering towards a Semantic Web with a high degree of redundancy in terms of identifiers / URIs. URIs for things that are essentially the same are being generated with a breathtaking pace, and a mapping between these entities is often not technically feasible (who wants to load a mapping file between Uniprot record URIs minted by Science Commons and those minted by Uniprot itself?). This problem has two causes: - technically, it is often quite hard to find existing resources. This needs to be addressed by the creation of services that allow for the quick retrieval of existing resources during ontology creation (this is the goal of http://sindice.com or the OKKAM project). - socially, many people are very reluctant to re-use entities that have a URI with a foreign namespace. This problem is still underestimated and has already done a lot of damage to the development of the Semantic Web. The use of PURLs eases this problem a bit, as they are perceived as a more neutral ground. Personally, I still believe that the use of completely opaque URIs (like 'urn:uuid:c2f41010-65b3-11d1-a29f-00aa00c14882') might be an interesting option, although this would be against the principles of the 'open linked data' initiative. Many of these and other questions are addressed in Jonathans texts about URIs (http://sw.neurocommons.org/2007/uri-note/). -------------- COMMERCIALISATION STRATEGIES ** Of course, Semantic Web technologies can be used both on the public internet as well as the intranet of organizations (laboratory, pharmaceutical industry, healthcare providers). However, I am interested in the possibility of making the public, global Semantic Web commercially useful. We should think about scenarios where the value of the Semantic Web for commercial enterprises is not solely based on using the technologies locally, but also on becoming part of the global HCLS Semantic Web community; donating Semantic Web resources where possible and, at the same time, profiting from the donations of others. An open source data and knowledge economy. The role of a Semantic Web company in such a scenario would not only be to tailor software applications to the specific needs of customers (i.e. HCLS institutions), but also to help customers become a 'good citizen' of the global Semantic Web - for their own benefit. ** Revenue from advertisements ** Revenue through targeted advertisements on websites is financing large portions of the current public web. Non-governmental institutions that plan to offer information resource on the Semantic Web need to be able to get some revenue from placing advertisements. In some cases, e.g., when the information is not offered through some HTML page but through a SPARQL endpoint, it is currently difficult to place targeted advertisements. It is important for the sustained growth of the public Semantic Web to explore strategies for placing advertisements in such scenarios. Because information and context is much more explicit in Semantic Web resources than on normal web pages, the potential for targeted advertising (similar to Google's AdSense) is huge. ------------------------------ ------------------------------ Many of the items are presented as choices between A or B, as I have the impression that such bold distinctions encourage feedback. Of course, most of them are not either-or choices but are rather continua where the best solution lies somewhere in between (but not necessarily in the middle). If there is interest in some of these topics, please reply so they can be discussed in more detail. If anyone is interested in extending this unorganized note to a publishable review or some W3C document, I would be happy to participate. Cheers, Matthias Samwald --- About me: http://neuroscientific.net/curriculum
Received on Wednesday, 7 November 2007 17:13:13 UTC