Re: Submitted "Emerging best practices for mapping life sciences data to RDF - a case series" from M. Scott Marshall on 2011-06-10 (public-semweb-lifesci@w3.org from June 2011)

From: M. Scott Marshall <mscottmarshall@gmail.com>
Date: Fri, 10 Jun 2011 14:50:48 +0200
To: linkedlifedatapracticesnote@googlegroups.com, Claus Stie Kallesøe <clausstiekallesoe@gmail.com>, Philip.Ashworth@ucb.com, HCLS <public-semweb-lifesci@w3.org>
Message-ID: <BANLkTi=5Loc8TkZkjQWiw2u6hnbbxqzKEg@mail.gmail.com>
[With Claus's cautious permission I'm CC'ing HCLS. I think that these
questions and the answers to follow are generally valuable.]

Hi Claus,

Thanks very much for your feedback. This is exactly the sort of
feedback that will help us to write a truly useful W3C note.

Just to be clear from the start: Your questions touch on
implementation issues that weren't quite yet in scope for the note. I
think that it is worth considering to what extent we include those
topics: user interfaces built on SPARQL and federation of SPARQL
endpoints.

> I am happy to help on the W3C note, but if its just going to be copy/paste
> from a paper that I didn't write I am not sure I can add so much value?
> But the reason I didn't participate on the paper was that I didn't feel I
> had the knowledge to write a best practices paper on RDF. Since then I have
> done some work, and used the paper, so at least I have some feedback on the
> paper and how it is to use for a beginner in the field of actually doing
> something and the issues I am having:

You're not mentioning the birth of a new child that happened around
that time. :) User requirements from domain experts such as yourself
are a valuable and essential contribution! And besides: from your
questions, I see that you have accrued a significant amount of
experience. Again, thanks for sharing your observations with us.

> I think the Figure 1 is good as it gives a good overview over the steps one
> need to go through.

Yes, the current plan is (still) to make the steps associated with
Figure 1 the core of the W3C note. Would one of the doc editors please
add that material to the Google Doc at the link below?

https://docs.google.com/document/d/1XzdsjCfPylcyOoNtDfAgz15HwRdCD-0e0ixh21_U0y0/edit?hl=en_US

> But I still have some missing links in my understanding
> in how to actually get from relational data (which I have a lot of) to a web
> front end that takes input from a user, converts the input to a sparql
> query, performs the query across multiple datasources and displays the
> results in a nice format for the user.

We haven't included material on user interface building, i.e.
converting input to a SPARQL query and displaying results in a format
that is nice for the user. It is indeed a confounding factor for
users/developers wanting to use SPARQL. Perhaps we could mention a few
possible approaches. Otherwise, we should declare it out of scope, if
it seems too ambitious. In general, we've been trying to deal with the
questions of setting up the SPARQL access to data that would otherwise
require an additional API to integrate into a given application.

Here are a few related ideas:

Input -> SPARQL query

Mapping the string labels used in a GUI to the identifiers used in a
SPARQL query can be a matter of using rdfs:labels directly from the
RDF. However, converting "input" to a SPARQL query is not always
straightforward and I expect will remain an area of research for some
years. Example: Natural language input that is mapped to the best
SPARQL query using Bayesian probability. However, there are situations
where there is a more straightforward mapping to the query, such as in
faceted browsing.

SPARQL query results -> formatted for end user consumption

There are a few nice approaches to this. I think first of
spatialization, which I recommended for HCLS KB demo query results. In
work with the HCLS KB, Alan Ruttenberg used a Google Maps coordinate
API to map search results to images. That approach was taken into use
wholesale by BIRN. Spatialization works well when you have an
attribute that can be mapped to a coordinate space. See also the
SIMILE demonstrations (2007?) of mapping to zipcodes on geographical
maps.

I've always said that some people in visualization get their
coordinate system for "free". ;) However, when you don't have a
spatialization other than what a PCA will give you, another approach
is to provide a list with facets/attributes of interest, such as
disease, genes, pathways, etc. Such as was done in a faceted browser
called slash facet ( "/facet") in 2007 with museum data. I will let
others fill in relevant examples here.

> So I have used D2R to map two of our inhouse datasources. Easy to use, gives
> a good start, front end on D2R server on the mapped data gives an idea about
> what it looks like so one can perform some edits manually by hand. Easily
> ends up being a 1:1 mapping table:class and that might not be the right
> thing. Sort of keeps you in the relational world while trying to go
> semantic.

Perhaps somebody can provide some tips to create better mappings from D2R?

> SO still need to work on this part to map the right concepts
> But at least I have two sparql end points via D2R on top of two of our
> oracle databases.
> Next step? Well I can write sparql againt each one of them - but then I
> might as well just use sql I think?

And you refer to SWObjects as a hacker's tool? ;)

> So I would like to link them so I can
> somwhow do a federated query across both sources at the same time. Make
> sense right?
> I have use SWobjects to do that and it works. But that to me is more a
> hacker tool. Maybe not the right way to go if one wants a stable, scalable
> solution where we can send all kinds of sparql queries?

I'm curious if you've followed the tutorial here:
http://tinyurl.com/swobjects-swat4ls , well, actually here:
http://www.w3.org/2010/Talks/1208-egp-swobjects/

[Note to Eric - you don't link out to the tutorial from the wiki yet!]
http://sourceforge.net/apps/mediawiki/swobjects/index.php?title=Main_Page

SWObjects doesn't have all the bells and whistles of D2R and thus
requires thorough knowledge of the target queries (in either SPARQL or
SQL) as well as the desired mapping - so you have to decide on the
desired semantics in one go. This makes it much more complicated to
use than D2R (this is probably what you mean by hacker's tool). So,
for a mapping to a relational database, you must know: your desired
target SQL query and how you want it to look in SPARQL in order to
create the SPARQL Construct(s). However, you've already pointed out
the problem with automatic map generation above: you end up with 1:1
mapping table:class, with no semantics, where you've essentially
postponed the problem of the above mapping choice.

One nice thing about SWObjects is that once you've expressed your
mapping rules as SPARQL Constructs, the query federation is
automatically done for you, with your SPARQL query being decomposed,
mapped and dispatched to the appropriate GRAPH services.

Scalability: I consider scalability to refer to federation, in which
case SWObjects is nicely scalable. The engine is written in C++, so it
should be fast (with hopefully no memory leaks!). Of course, feature
requests should come out of new work with SWObjects. If we refer to
scaling up as the process of setting up a federation across more than
a handful of endpoints, that could be an issue. I would like to see an
DBVisualizer style interface built that can generate the SPARQL
Constructs more easily than the current approach demanding manual
SPARQL writing. I think that such an interface would make SWObjects a
lot more useful.

Stability: If it has crashed, would you please issue a bug report to
the SWObjects mailing list? Otherwise, the biggest gap that we are
attempting to deal with at the moment is Oracle drivers. Eric doesn't
have the bandwidth to write the drivers himself and we are still
hoping that Oracle will help us write the drivers in order to make
SWObjects a viable choice for some of their interested clients
(ongoing..). I believe that another point for improvement that has
been noted is better integration with Apache instead of the current
http service, thrown together in a few hours. Volunteers?

> My mate Phil from UCB has mapped their internal data sources via D2R mapping
> and then done some integration work by making a dataset ontology via VoID
> (linksets) and a concept ontology via SKOS (narrowmatch between general
> concepts and the classes in the different sources).
> In order to do the same I first need to find an ontology tool, understand
> VoID and SKOS and udnerstand how to use these things correctly together.
> That is not a quick and simple thing - at least for me! A lot of
> questions/unknowns here for beginners.
> But I tried and have (maybe) an ontology that describes some Lundbeck
> concepts and how they are linked to classes in some of our datasources. Then
> what?

I am CC'ing Philip Ashworth so that he can answer you once he's rested
up from SemTech in San Francisco, California, where he presented his
approach to federation.

> Then I need a tool that can use my new "linking ontoloty" to create a
> common/federated sparql end point so my web app can go there to ask
> questions, right? What tool would that be?
> Think Phil from UCB uses a tool from topbraid that I do not have yet. So
> maybe its easy and astraight forward if I get that?

I understood from Phil Brooks's tutorial at the EBI SemWeb Industry
workshop that D2R is nicely integrated into TopBraid Composer although
I haven't tried it myself. Yes, TopBraid is widely touted and they
have a free version by the way.

> I then discovery Silk and was thinking that maybe that would be able to help
> me link my two data sources? When I read about silk it seems like thats what
> it can do? But never got started until Anja et al launched their workbench.
> So I now have a linkdescription made by/in Silk workbench. Fairly easy and
> straight forward to do. ANd then what?

Looks like a good question for Anja.

> As I asked yesterday at the call and linked to the above situation: I now
> have a link description that knows about my datasources and how/where they
> link. Now I again need some tool that can use that to display a sparql end
> point for me to point my searches. Or am I completely off here? My thinking
> is that with a nice link description like that "some tool" must be able to
> find data in the right places - if not what is the point of Silk?

I'll leave that for Anja as well.

I should mention though, that if you've created the SPARQL Construct
mappings for SWObjects, you *started* with the knowledge of where your
query would be answered. You actually include named graph references
(GRAPH services) in the SPARQL Construct (again: see tutorial). Once
those mappings are created, you issue your SPARQL query as if all the
data were in one place and using your own terms/URIs. BTW, yet another
approach to automatic resource discovery worth considering (not
covered in the emerging practices paper) is SADI / SHARE.

> So thats where I am now. I am sorry if I have offended anyone on the way -
> that surely isn't my intension. I just wnated to show you, the academic
> experts, what a relational centric pharma informatics person go through in
> order to get going with the semantic technologies and linked data. ANd I
> hope it could be of use when we right a W3C note that should help others
> getting started?

Very useful! No offence taken by anyone for honest questions, I'm
sure. I hope that we can help to get you back on track shortly as well
as pave the way for the next person.

Cheers,
Scott

> On 8 June 2011 16:32, M. Scott Marshall <mscottmarshall@gmail.com> wrote:
>>
>> I haven't received an answer yet but the excerpts from the copyright
>> URL below state fairly clearly that we are within our rights to use
>> the same material in the W3C note.
>>
>> -Scott
>>
>> ---------- Forwarded message ----------
>> From: M. Scott Marshall <mscottmarshall@gmail.com>
>> Date: Wed, Jun 8, 2011 at 10:56 AM
>> Subject: Re: Submitted "Emerging best practices for mapping life
>> sciences data to RDF - a case series"
>> To: k.s.schlobach@vu.nl
>>
>>
>> Dear Stefan,
>>
>> As part of the current HCLS charter, we plan to create a W3C note on
>> the same topic as the submitted article in the next few months. It
>> will be a 'derived work', based on overlapping material. I looked at
>> the journal policies about such things and it seems to be allowed.
>>
>>
>> From http://www.elsevier.com/wps/find/authorshome.authors/copyright#rights :
>>
>> * the right to post a pre-print version of the journal article on
>> Internet web sites including electronic pre-print servers, and to
>> retain indefinitely such version on such servers or sites for
>> scholarly purposes* (with some exceptions such as The Lancet and Cell
>> Press. See also our information on electronic preprints for a more
>> detailed discussion on these points)*;
>>
>> * the right to prepare other derivative works, to extend the journal
>> article into book-length form, or to otherwise re-use portions or
>> excerpts in other works, with full acknowledgement of its original
>> publication in the journal.
>>
>> Please let me know if you think this would pose a problem. My
>> expectation is that the W3C note will be accessed by a different
>> audience and, although relatively obscure, could act as an
>> advertisement for the journal article (whose formatting would appeal
>> to more readers) if we refer to the anticipated publication in the W3C
>> note.
>>
>> -Scott
>>
Received on Friday, 10 June 2011 12:51:16 UTC