Re: Submitted "Emerging best practices for mapping life sciences data to RDF - a case series" from Claus Stie Kallesøe on 2011-06-10 (public-semweb-lifesci@w3.org from June 2011)

From: Claus Stie Kallesøe <clausstiekallesoe@gmail.com>
Date: Fri, 10 Jun 2011 15:50:03 +0200
To: "M. Scott Marshall" <mscottmarshall@gmail.com>
Cc: linkedlifedatapracticesnote@googlegroups.com, Philip.Ashworth@ucb.com, HCLS <public-semweb-lifesci@w3.org>
Message-ID: <BANLkTikq8Q0uWP_3GCQaFD853CKjLqEYJw@mail.gmail.com>
Hi Scott,

and thank you for the answers. I see from your answers that some of my
comment/questions (my semantics!) wasn't clear so I will try to clarify
below

On 10 June 2011 14:50, M. Scott Marshall <mscottmarshall@gmail.com> wrote:

> [With Claus's cautious permission I'm CC'ing HCLS. I think that these
> questions and the answers to follow are generally valuable.]
>

no problem - as said someone needs to ask the stupid questions in public.
And that is then me...

>
> Hi Claus,
>
> Thanks very much for your feedback. This is exactly the sort of
> feedback that will help us to write a truly useful W3C note.
>
> Just to be clear from the start: Your questions touch on
> implementation issues that weren't quite yet in scope for the note. I
> think that it is worth considering to what extent we include those
> topics: user interfaces built on SPARQL and federation of SPARQL
> endpoints.
>

ok, actually the user interface is not so much an issue. I have the GUI and
I have programmers who today take user input and converts that to sql and
displays the result (we use extjs on top of RoR). I am sure they can convert
to sparql as well and display what is comming back. Either in RDF or by way
of jason as we already do today.

My concern is more the "webserver" (lack of a better word) that will host
the one single sparql end point where we will send the question and handle
the federation


>
> > I am happy to help on the W3C note, but if its just going to be
> copy/paste
> > from a paper that I didn't write I am not sure I can add so much value?
> > But the reason I didn't participate on the paper was that I didn't feel I
> > had the knowledge to write a best practices paper on RDF. Since then I
> have
> > done some work, and used the paper, so at least I have some feedback on
> the
> > paper and how it is to use for a beginner in the field of actually doing
> > something and the issues I am having:
>
> You're not mentioning the birth of a new child that happened around
> that time. :)

true, but think the lack of knowledge was more important here. but thank you
for trying to save me here ;-)


> User requirements from domain experts such as yourself
> are a valuable and essential contribution! And besides: from your
> questions, I see that you have accrued a significant amount of
> experience. Again, thanks for sharing your observations with us.
>

yes I understand and thats also why I decided to write it despite some of it
is just based on my ignorance and lack of time to read everything out there!
But it's likely others are in the same situation and the point with the HCLS
group is to help other like me cross the barrier of entry here. And then I
expect I get a lot of good and valuable answers

>
> > I think the Figure 1 is good as it gives a good overview over the steps
> one
> > need to go through.
>
> Yes, the current plan is (still) to make the steps associated with
> Figure 1 the core of the W3C note. Would one of the doc editors please
> add that material to the Google Doc at the link below?
>
>
> https://docs.google.com/document/d/1XzdsjCfPylcyOoNtDfAgz15HwRdCD-0e0ixh21_U0y0/edit?hl=en_US
>
> > But I still have some missing links in my understanding
> > in how to actually get from relational data (which I have a lot of) to a
> web
> > front end that takes input from a user, converts the input to a sparql
> > query, performs the query across multiple datasources and displays the
> > results in a nice format for the user.
>
> We haven't included material on user interface building, i.e.
> converting input to a SPARQL query and displaying results in a format
> that is nice for the user.


as said. this is of less importance to me. actually think we can deal with
that.


> It is indeed a confounding factor for
> users/developers wanting to use SPARQL. Perhaps we could mention a few
> possible approaches. Otherwise, we should declare it out of scope, if
> it seems too ambitious.


think we can agree its out of scope


> In general, we've been trying to deal with the
> questions of setting up the SPARQL access to data that would otherwise
> require an additional API to integrate into a given application.
>
> Here are a few related ideas:
>
> Input -> SPARQL query
>
> Mapping the string labels used in a GUI to the identifiers used in a
> SPARQL query can be a matter of using rdfs:labels directly from the
> RDF. However, converting "input" to a SPARQL query is not always
> straightforward and I expect will remain an area of research for some
> years. Example: Natural language input that is mapped to the best
> SPARQL query using Bayesian probability. However, there are situations
> where there is a more straightforward mapping to the query, such as in
> faceted browsing.
>
> SPARQL query results -> formatted for end user consumption
>
> There are a few nice approaches to this. I think first of
> spatialization, which I recommended for HCLS KB demo query results. In
> work with the HCLS KB, Alan Ruttenberg used a Google Maps coordinate
> API to map search results to images. That approach was taken into use
> wholesale by BIRN. Spatialization works well when you have an
> attribute that can be mapped to a coordinate space. See also the
> SIMILE demonstrations (2007?) of mapping to zipcodes on geographical
> maps.
>
> I've always said that some people in visualization get their
> coordinate system for "free". ;) However, when you don't have a
> spatialization other than what a PCA will give you, another approach
> is to provide a list with facets/attributes of interest, such as
> disease, genes, pathways, etc. Such as was done in a faceted browser
> called slash facet ( "/facet") in 2007 with museum data. I will let
> others fill in relevant examples here.
>
> > So I have used D2R to map two of our inhouse datasources. Easy to use,
> gives
> > a good start, front end on D2R server on the mapped data gives an idea
> about
> > what it looks like so one can perform some edits manually by hand. Easily
> > ends up being a 1:1 mapping table:class and that might not be the right
> > thing. Sort of keeps you in the relational world while trying to go
> > semantic.
>
> Perhaps somebody can provide some tips to create better mappings from D2R?
>

again my mate Phil (who is now on cc) has already helped me here - in
telling me that I would likely end there with my relational/medicinal
chemistry background!

editing the mappings are easy. its more a matter of getting our minds out of
tables and see real "concepts". So actually think its a matter of getting
used to this and the my head around it. BUt if someone has some good advice
than I am naturaly ready to listen


>
> > SO still need to work on this part to map the right concepts
> > But at least I have two sparql end points via D2R on top of two of our
> > oracle databases.
> > Next step? Well I can write sparql againt each one of them - but then I
> > might as well just use sql I think?
>
> And you refer to SWObjects as a hacker's tool? ;)
>

well, not in the sense that its hard to use. Maybe more in the sense that I
am not sure that its ready to be run in full production with 500 users
running sparqls across multiple datasources? seems more like a very nice
tool to test/explore datasets

But I might naturally be worng here?

>
> > So I would like to link them so I can
> > somwhow do a federated query across both sources at the same time. Make
> > sense right?
> > I have use SWobjects to do that and it works. But that to me is more a
> > hacker tool. Maybe not the right way to go if one wants a stable,
> scalable
> > solution where we can send all kinds of sparql queries?
>
> I'm curious if you've followed the tutorial here:
> http://tinyurl.com/swobjects-swat4ls , well, actually here:
> http://www.w3.org/2010/Talks/1208-egp-swobjects/
>
>
Yes, a bit at least! I will be honest and say that Helena Deus showed it to
me as we were discussion another matter.



> [Note to Eric - you don't link out to the tutorial from the wiki yet!]
> http://sourceforge.net/apps/mediawiki/swobjects/index.php?title=Main_Page
>
> SWObjects doesn't have all the bells and whistles of D2R and thus
> requires thorough knowledge of the target queries (in either SPARQL or
> SQL) as well as the desired mapping - so you have to decide on the
> desired semantics in one go. This makes it much more complicated to
> use than D2R (this is probably what you mean by hacker's tool). So,
> for a mapping to a relational database, you must know: your desired
> target SQL query and how you want it to look in SPARQL in order to
> create the SPARQL Construct(s).


And I aim for a GUI where scientist can answer all sorts of questions about
our data (and experience shows that they do if possible!) and we then need
to turn it into sparql and execute. So I dont know the query and needed
construct before the user enters their question. Then I will build the
sparql automatically

So I would think that one needs to build a new federated graph based on the
underlying data and hold that somewhere somehow



> However, you've already pointed out
> the problem with automatic map generation above: you end up with 1:1
> mapping table:class, with no semantics, where you've essentially
> postponed the problem of the above mapping choice.
>
> One nice thing about SWObjects is that once you've expressed your
> mapping rules as SPARQL Constructs, the query federation is
> automatically done for you, with your SPARQL query being decomposed,
> mapped and dispatched to the appropriate GRAPH services.
>
> Scalability: I consider scalability to refer to federation, in which
> case SWObjects is nicely scalable. The engine is written in C++, so it
> should be fast (with hopefully no memory leaks!). Of course, feature
> requests should come out of new work with SWObjects. If we refer to
> scaling up as the process of setting up a federation across more than
> a handful of endpoints, that could be an issue. I would like to see an
> DBVisualizer style interface built that can generate the SPARQL
> Constructs more easily than the current approach demanding manual
> SPARQL writing. I think that such an interface would make SWObjects a
> lot more useful.
>

yes, sounds like it

>
> Stability: If it has crashed, would you please issue a bug report to
> the SWObjects mailing list? Otherwise, the biggest gap that we are
> attempting to deal with at the moment is Oracle drivers. Eric doesn't
> have the bandwidth to write the drivers himself and we are still
> hoping that Oracle will help us write the drivers in order to make
> SWObjects a viable choice for some of their interested clients
> (ongoing..). I believe that another point for improvement that has
> been noted is better integration with Apache instead of the current
> http service, thrown together in a few hours. Volunteers?
>
> > My mate Phil from UCB has mapped their internal data sources via D2R
> mapping
> > and then done some integration work by making a dataset ontology via VoID
> > (linksets) and a concept ontology via SKOS (narrowmatch between general
> > concepts and the classes in the different sources).
> > In order to do the same I first need to find an ontology tool, understand
> > VoID and SKOS and udnerstand how to use these things correctly together.
> > That is not a quick and simple thing - at least for me! A lot of
> > questions/unknowns here for beginners.
> > But I tried and have (maybe) an ontology that describes some Lundbeck
> > concepts and how they are linked to classes in some of our datasources.
> Then
> > what?
>
> I am CC'ing Philip Ashworth so that he can answer you once he's rested
> up from SemTech in San Francisco, California, where he presented his
> approach to federation.
>

well as said I have talked to phil a lot (and have likely used all my
credits here ;-)) so I sort of know what he does.

but as a more general interest here I guess is the learning curve to enter
"ontology development". What is SKOS? What is VoID? Do I need both? is
owl:sameAs better than SKOS:narrowMatch or is it the other way around or is
it for something completely different.

naturally this is just a matter of reading! But there is a lot to read to
get started and for a beginner its not 100% clear what to use when - also
after the reading.


>
> > Then I need a tool that can use my new "linking ontoloty" to create a
> > common/federated sparql end point so my web app can go there to ask
> > questions, right? What tool would that be?
> > Think Phil from UCB uses a tool from topbraid that I do not have yet. So
> > maybe its easy and astraight forward if I get that?
>
> I understood from Phil Brooks's tutorial at the EBI SemWeb Industry
> workshop that D2R is nicely integrated into TopBraid Composer although
> I haven't tried it myself. Yes, TopBraid is widely touted and they
> have a free version by the way.
>

Yes, but as far as I know  only the top bells and whistle version of Top
braid has the SPARQL/motion etc stuff as well as the publishing tool that
might be what I am missing. So, as mentioned in the first mail, it might be
fairly straight forward if I get the top version of topbraid inhouse?

But does that mean that there is nothing else out there yet that can give
the "integration layer" from D2R mapped sparql endpoints to a federated
query service. I would just be surprised given all the other nice tools
available



>
> > I then discovery Silk and was thinking that maybe that would be able to
> help
> > me link my two data sources? When I read about silk it seems like thats
> what
> > it can do? But never got started until Anja et al launched their
> workbench.
> > So I now have a linkdescription made by/in Silk workbench. Fairly easy
> and
> > straight forward to do. ANd then what?
>
> Looks like a good question for Anja.
>
> > As I asked yesterday at the call and linked to the above situation: I now
> > have a link description that knows about my datasources and how/where
> they
> > link. Now I again need some tool that can use that to display a sparql
> end
> > point for me to point my searches. Or am I completely off here? My
> thinking
> > is that with a nice link description like that "some tool" must be able
> to
> > find data in the right places - if not what is the point of Silk?
>

here I should not have written what is the point of Silk. More what is the
point of the output - the link description. Its cool that I can use Silk to
run over my sources and find the real actual links and even keep them
updated (via a Silk server if I understood Anja correct)  but who/what uses
the link description. isn't it a perfect start of a new federated graph?
Anja?

>
> I'll leave that for Anja as well.
>
> I should mention though, that if you've created the SPARQL Construct
> mappings for SWObjects, you *started* with the knowledge of where your
> query would be answered. You actually include named graph references
> (GRAPH services) in the SPARQL Construct (again: see tutorial). Once
> those mappings are created, you issue your SPARQL query as if all the
> data were in one place and using your own terms/URIs. BTW, yet another
> approach to automatic resource discovery worth considering (not
> covered in the emerging practices paper) is SADI / SHARE.
>

Thanks didn't know about them. More reading ;-)


>
> > So thats where I am now. I am sorry if I have offended anyone on the way
> -
> > that surely isn't my intension. I just wnated to show you, the academic
> > experts, what a relational centric pharma informatics person go through
> in
> > order to get going with the semantic technologies and linked data. ANd I
> > hope it could be of use when we right a W3C note that should help others
> > getting started?
>
> Very useful! No offence taken by anyone for honest questions, I'm
> sure. I hope that we can help to get you back on track shortly as well
> as pave the way for the next person.
>

I hope it can help the next guy. I should maybe just hire a Top Braid
consultant!

claus




>
> Cheers,
> Scott
>
> > On 8 June 2011 16:32, M. Scott Marshall <mscottmarshall@gmail.com>
> wrote:
> >>
> >> I haven't received an answer yet but the excerpts from the copyright
> >> URL below state fairly clearly that we are within our rights to use
> >> the same material in the W3C note.
> >>
> >> -Scott
> >>
> >> ---------- Forwarded message ----------
> >> From: M. Scott Marshall <mscottmarshall@gmail.com>
> >> Date: Wed, Jun 8, 2011 at 10:56 AM
> >> Subject: Re: Submitted "Emerging best practices for mapping life
> >> sciences data to RDF - a case series"
> >> To: k.s.schlobach@vu.nl
> >>
> >>
> >> Dear Stefan,
> >>
> >> As part of the current HCLS charter, we plan to create a W3C note on
> >> the same topic as the submitted article in the next few months. It
> >> will be a 'derived work', based on overlapping material. I looked at
> >> the journal policies about such things and it seems to be allowed.
> >>
> >>
> >> From
> http://www.elsevier.com/wps/find/authorshome.authors/copyright#rights :
> >>
> >> * the right to post a pre-print version of the journal article on
> >> Internet web sites including electronic pre-print servers, and to
> >> retain indefinitely such version on such servers or sites for
> >> scholarly purposes* (with some exceptions such as The Lancet and Cell
> >> Press. See also our information on electronic preprints for a more
> >> detailed discussion on these points)*;
> >>
> >> * the right to prepare other derivative works, to extend the journal
> >> article into book-length form, or to otherwise re-use portions or
> >> excerpts in other works, with full acknowledgement of its original
> >> publication in the journal.
> >>
> >> Please let me know if you think this would pose a problem. My
> >> expectation is that the W3C note will be accessed by a different
> >> audience and, although relatively obscure, could act as an
> >> advertisement for the journal article (whose formatting would appeal
> >> to more readers) if we refer to the anticipated publication in the W3C
> >> note.
> >>
> >> -Scott
> >>
>
Received on Tuesday, 14 June 2011 00:44:15 UTC