Re: making statements on the semantic web from Marc-Alexandre Nolin on 2007-08-23 (public-semweb-lifesci@w3.org from August 2007)

From: Marc-Alexandre Nolin <lotus@ieee.org>
Date: Thu, 23 Aug 2007 13:47:45 -0400
To: Michel_Dumontier <Michel_Dumontier@carleton.ca>
Cc: "public-semweb-lifesci hcls" <public-semweb-lifesci@w3.org>
Message-ID: <d6a9bb0d0708231047o2a05c992k89b11f58dd1e5036@mail.gmail.com>
Hi,

<for a bit about how to install, see the end of this message after the line>

About the identifier, the essential and shorter way to identify the
entity you propose is uniprot:p26838 , but contrary to
http://bio2rdf.org/uniprot:P26838,
http://beta.uniprot.org/uniprot/P26838.rdf and
urn:lsid:uniprot.org:uniprot:P26838 (if you have a lsid resolver),
this does not resolve anywhere. This is why it is not listed as a
owl:sameAs but as an identificator with dc:identifier. We used
dc:identifier, because it fit the general definition and we didn't
want to create another predicate when another already exist in a
recognize ontology. But this is not in the rock and I'm open to
discussion.

The part about the identifier his, this is really the only part that
matter in a URIs. If tomorrow, for no reason whatsoever, Uniprot close
their website, but server X still have a copy of the Uniprot database
and provide it on the net. So if you still want information about the
identifier uniprot:p26838, just ask http://X.com/uniprot:p26838.
Granted, the URIs won't be http://beta.uniprot.com/uniprot/P26838.rdf
anymore, but it will be the vision of X about the identifier
uniprot:p26838. When we produce the RDF at
http://bio2rdf.org/uniprot:P26838, this isn't the document at
http://beta.uniprot.org/uniprot/P26838.rdf anymore. It is the vision
of Bio2rdf.org about uniprot:p26838. There are difference in the
document, they are the transformation of URIs to be compatible with
Bio2RDF and the reference to the original documents which, of course,
aren't in the Uniprot document.

You often talk about wanting to make a statement about a resource.
Nothing forbid you of doing it right now. You have a domain name, make
a URI with it with the dc:identifier part of it
http://dumontierlab.com/uniprot:p26838, add the statement you want to
says and link to the original document.

The present list of non-web resources that we maintain on Bio2RDF it
is done manually, we are two searcher here and we add then one at a
time.

There is a service to provide link and reverse link to entity in the
semantic web. It's named Sindice (http://sindice.com). Although, I
don't know if they have the capacities to index the amount of data of
the health care and life sciences.

-------------------------------------------------

Their has been modification on the Bio2RDF server that are not yet
available on the SourceForge Package. As soon as I have some times for
myself, those will be made available too.

Bio2RDF is essentially a servlet, so you need to have a working Tomcat
server available. When the servlet is installed, it will be in
http://<your server>:<your port, if not 80>/bio2rdf. To be able to
have an exact replicat of the main Bio2RDF server, you can place the
content of the war file in webapps/ROOT. The content of the War file
is a modified Sesame 1.26, Elmo, a directory of rdfizer and an
urlRewrite filter. Have your computer on a none routable adresse an
you can internally name it bio2rdf.org.

The stable URI interface come from the urlRewrite filter, wherever the
place or the application that provide me the RDF file (be it a LSID
resolver, a JSP program, a Perl program, etc.),

As an example, this is the rule of resolution for Uniprot, both
queries can be made to the server, except the first one give you a
stable URI.

<rule>
  <from>^/uniprot/(.*)</from>
  <to>/uniprot-uri2rdf.jsp?ns=uniprot&amp;id=$1</to>
</rule>

This is transparent and can be change to whatever I want. When a data
provider follow the same rules as you and you don't need anymore to
adapt it, the urlrewrite rules can be change to

<rule>
  <from>^/uniprot/(.*)</from>
  <to>http://purl.uniprot.org/uniprot:$1</to>
</rule>

So, you can adapt the urlrewrite to take your data from wherever you
want. If you me to explain the installation of Bio2RDF further, I can
explain you this off list since it is not completely related to URIs
discussions.

Bye !!

Marc-Alexandre

2007/8/20, Michel_Dumontier <Michel_Dumontier@carleton.ca>:
> Thanks Marc-Alexandre,
>
> Bio2RDF is certainly making a valuable contribution by RDFizing
> documents and making them available via a publicly accessible server.
> How do we set it up locally - I couldn't figure it out from navigating
> the website?
>
> Bio2RDF assigns a URI in its own namespace for self-resolution of
> imported data, and it adds "owl:sameAs" predicates to link to other
> URIs, such RDF data providers such as Uniprot (although it does point to
> the beta version of the site http://beta.uniprot.org/), thus supporting
> the advocated "follow your nose", while accounting for redundancy.
> However, it links the putative global unique identifier as defined in
> the Banff Manifesto using the predicate
> "http://purl.org/dc/elements/1.1/identifier", rather than asserting it
> as a resource in its own right and made equivalent with owl:sameAs. I
> don't understand this distinction. Is "uniprot:p26838" not equivalent to
> " http://bio2rdf.org/uniprot:P26838" or "
> http://beta.uniprot.org/uniprot/P26838.rdf" or "
> urn:lsid:uniprot.org:uniprot:P26838"???
>
> As more and more people create their own URIs (and they apparently
> will), the lack of an authoritative global identifier will challenge
> data integration efforts. The use of (a global identifier with) sameAs
> predicates facilitates data integration and supports multiple resolution
> mechanisms, whether HTTP URL or LSID or other, such that agents can be
> programmed to discover something about these, if there is something to
> discover. Triple store resolution of owl:sameAs will also be necessary
> (if not already supported). This, at least to me, is a solution that
> doesn't require zealous belief in one method or another, but allows all
> of them to coexist, peacefully, and according to need.
>
> The bigger problem, is how do we discover all the places that are making
> statements about these non-web resources?  While Bio2RDF lists a few
> equivalent resources, will it maintain this list manually? Perhaps more
> valuable is whether we entice Google to index our public triple stores,
> telling us where triples containing "uniprot:p26838" exist, thereby
> enabling distributed queries. I've brought this up a few times now, and
> I'd very much like to hear what people think...
>
> -=Michel=-
>
>
> > -----Original Message-----
> > From: manolin@gmail.com [mailto:manolin@gmail.com] On Behalf Of Marc-
> > Alexandre Nolin
> > Sent: Monday, August 20, 2007 3:11 PM
> > To: Michel_Dumontier
> > Cc: gregtyrelle@phalanxbiotech.com; public-semweb-lifesci hcls
> > Subject: Re: making statements on the semantic web
> >
> > Hi,
> >
> > Sorry, I miss a conversation implicating Bio2RDF on the mailing list,
> > but I was in vacation in the last 2 weeks.
> >
> > You can see Bio2RDF as a proxy for "LINKED" RDF documents. That is,
> > some data provider give access to a RDF data representation of their
> > data, like Uniprot, and we are very thankful to them, this make our
> > job much easier. Other data provider, maybe because of their scale
> > (small lab with not many informatics resources) or lack of interest,
> > don't provide RDF version of their data.
> >
> > On problem come when an RDF document link out to an identifier of
> > another data provider, which doesn't have anything that answer to this
> > URI. Sometimes, the link out data providers does have a URI, but it is
> > not well written (case error, HTTP URL error, no LSID resolver to
> > resolve it, etc.). Again, in this case, the URI won't connect and the
> > document won't be linked together.
> >
> > When we integrate a new data source into Bio2RDF, we make sure that
> > any URI that we create follow a simple set of rules that we put in
> > place. Then, automagically, it connect, all of them.
> >
> > There are two way we transform data: live from the sources or in
> > batch. But in either the case, we keep the references to the original
> > unmodified document because we do modified them for the data to link
> > together. The reason we may do the RDFization in batch instead of live
> > from the sources is because of the scale of the original database,
> > limitation for number of request by minute (NCBI) or the available
> > format from the sources.
> >
> > Just to add to a comment of Matthias, it's very sure that Bio2RDF.org
> > don't have the reputation of PURL.org. We also don't provide the same
> > services than PURL provide, but we are open to discussion, If someone
> > want to add to the RDFizer we already have, we welcome them. All of
> > our RDFizer we already have are available in Open Sources. You can
> > install locally a Bio2RDF.org system in your local private
> > infrastructure, name locally the computer Bio2RDF.org and it will also
> > work if you fear that we may go down. Many Bio2RDF.org may exist, an
> > be chain together by the urlRewrite filter if the local one can't
> > answer to the query. Has for going down, if people start to use us
> > enough, it will be an happy problem, money won't be that much of a
> > problem in this case. :)
> >
> > Also, we currently use our own set of rules scheme for the definition
> > of our URIs since there is none currently (see http://bio2rdf.org/wiki
> > for the rules). Note that the set of rules are in a wiki, if someone
> > want to add comments or request a change of representation, it can be
> > add there. We even provide LSID resolver possibility for it because it
> > is not a close discussion on the subject. And when a URI
> > recommendation will come from Jonathan Rees, we will implement it as
> > well. I won't lie to anybody, I would prefer to go with a HTTP URIs
> > instead of an LSID because a would prefer to work without the extra
> > resolver layer above the DNS, but if the community really push for it,
> > I will go for it. I assure you, we try to be as neutral as we can be
> > :)
> >
> > In the end, what are the probabilities that Purl.org or Bio2RDF.org
> > will be there in 5 years, 10 years, 20 years from here? Granted, one
> > have a greater probability to still be there, but the probability of
> > not being there is also not zero.
> >
> > If all data provider would already provide RDF document with URIs all
> > written with the same set of rules for them to connect, Bio2RDF would
> > not need to exist, neither HCLS for that matter. We provide a service
> > for something that is not currently there: linked data in
> > bioinformatics. We are like a proof of concept to say to the world
> > "Hey, if you just write your URIs with a simple set of rules, it work,
> > it will all connect together, the HCLS demo also show that". But we
> > are more than a proof of concept, we are already in use in our
> > research center. We create custom knowledge base of linked data for
> > research group here.
> >
> > Bye !!
> >
> > Marc-Alexandre
> >
> > 2007/8/9, Michel_Dumontier <Michel_Dumontier@carleton.ca>:
> > >
> > > Hi Greg,
> > >
> > >  Yes, we want to make statements about genes/proteins, not html
> pages.
> > > For instance, the genomic feature identified by S000001855, or the
> > > protein identified by YHR023W. What are the URI of these?
> > >
> > > > A centralized registry, PURL schemes etc. have been suggested, and
> > > > they will *potentially* solve this problem,
> > >
> > > Ooooh - do tell!
> > >
> > > > The zen moment is, you are an authority, just not the authority.
> In
> > > > which case it doesn't matter. Create URIs in your own namespace
> for
> > > > whatever non-information resources you want, proteins, genes etc.
> and
> > > > worry about the data integration problem after the fact. After all
> RDF
> > > > itself does not do data integration, it just facilitates data
> > > > integration. If your URI identifiers contain SGD gene names or
> other
> > > > database identifiers, then direct identifier mapping should be
> > > > feasible. If not various smushing [1] techniques could be
> employed.
> > > >
> > >
> > > Yes - this is the crux of the problem. Data integration has
> > > traditionally been done by "mapping" one identifier to another. I've
> > > been doing that for 7 years now, and it's getting harder and harder
> with
> > > the increase in the type and quantity of data, as well as the
> increase
> > > in third party annotation on existing resources. I've got tables
> with
> > > hundreds of millions of rows to map identifiers for identical
> resources.
> > > I believe this to be 1) unproductive, 2) costly and 3) unnecessary.
> > >
> > > The semantic web framework provides the capability to trivially
> > > integrate data with named resources.  That is a major benefit of the
> > > technology, which we would be aptly sidestepping by each of us
> minting
> > > our own identifiers and then doing the n^2/n mappings to finally
> > > integrate all the data. It is significantly more feasible and cost
> > > effective to setup a global registry and have it enforced via
> journals
> > > (as has been successfully done with sequence data). One registry,
> one
> > > global identifier... it's really simple, authoratative, and makes
> > > everybody's life about a hundred million times easier.
> > >
> > > -=Michel=-
> > >
> > > Michel Dumontier
> > > Assistant Professor of Bioinformatics
> > >
> > > Department of Biology, School of Computer Science, Institute of
> > > Biochemistry
> > > Carleton University
> > >
> > > Member of the Ottawa Institute of Systems Biology
> > > Member of the Ottawa-Carleton Institute for Biomedical Engineering
> > >
> > > Office: 4610 Carleton Technology and Training Center
> > > Mailing: 209 Nesbitt, 1125 Colonel By Drive, Ottawa, ON K1S5B6
> > > Tel:  +1 (613) 520-2600 x4194
> > > Fax:  +1 (613) 520-3539
> > > Web:  http://dumontierlab.com
> > > Skype: micheldumontier
> > >
> > > > -----Original Message-----
> > > > From: greg.tyrelle@gmail.com [mailto:greg.tyrelle@gmail.com] On
> Behalf
> > > Of
> > > > Greg Tyrelle
> > > > Sent: Thursday, August 09, 2007 6:42 AM
> > > > To: Michel_Dumontier
> > > > Cc: public-semweb-lifesci hcls
> > > > Subject: Re: making statements on the semantic web
> > > >
> > > > On 8/7/07, Michel_Dumontier <Michel_Dumontier@carleton.ca> wrote:
> > > > >   So a key concern for me is how I, as a user of public
> resources,
> > > > > should make statements about them on the semantic web. While
> certain
> > > > > data providers might already providing RDF/OWL data with some
> URI,
> > > what
> > > > > about those that have yet to do this? How should I reference a
> > > public
> > > > > resource provided by the SGD [1] or candidadb [2]? Moreover,
> what
> > > about
> > > > > the ~1000 database [3] with valuable content, much of it locked
> away
> > > in
> > > > > relational databases or flat files? How do I make statements
> about
> > > these
> > > > > resources, without taking the responsibility of serving it up in
> my
> > > own
> > > > > namespace [4], which might ultimately not integrate with content
> > > from
> > > > > another 3rd party content provider.
> > > >
> > > > Do you want to make statements about the HTML representation of
> the
> > > > database records in SGD ? I will assume this is not the case as
> these
> > > > records already have URL identifiers. Or do you want to make
> > > > statements about yeast proteins/genes, where SGD is likely to be
> the
> > > > authority for providing stable identifiers for said proteins/genes
> ?
> > > >
> > > > If it is the second case, and if I understand you correctly, then
> your
> > > > problem is that currently SGD does not provide stable URIs for
> yeast
> > > > genes (non-information resources, not database records), but
> > > > nonetheless you want to make statements about these
> non-information
> > > > resources now, without creating further data integration hassles
> by
> > > > minting your own identifiers for these non-information resources
> which
> > > > will ultimately be equivalent to the identifiers provided by SGD,
> if
> > > > and when they do start providing these stable identifiers ?
> > > >
> > > > >   Inline with my previous comments about the value of the
> semantic
> > > web
> > > > > for data integration, it would be of great value to have data
> > > providers
> > > > > _register_ the namespace of their resources. In fact, coupling
> NAR
> > > > > database issue with base URI registration would open up entirely
> new
> > > > > worlds for data integration. Do you think this is worthwhile or
> > > > > feasible? What other approaches might be considered to alleviate
> > > this
> > > > > problem?
> > > >
> > > > A centralized registry, PURL schemes etc. have been suggested, and
> > > > they will *potentially* solve this problem, but they don't help a
> > > > yeast biologist from making statements about the yest protein
> GCN4,
> > > > right now. Which stable URI should you use for that protein if one
> > > > doesn't already exist and you're not the authority ? You don't
> want to
> > > > wait for one to be made available...
> > > >
> > > > The zen moment is, you are an authority, just not the authority.
> In
> > > > which case it doesn't matter. Create URIs in your own namespace
> for
> > > > whatever non-information resources you want, proteins, genes etc.
> and
> > > > worry about the data integration problem after the fact. After all
> RDF
> > > > itself does not do data integration, it just facilitates data
> > > > integration. If your URI identifiers contain SGD gene names or
> other
> > > > database identifiers, then direct identifier mapping should be
> > > > feasible. If not various smushing [1] techniques could be
> employed.
> > > >
> > > > _greg
> > > >
> > > > [1] http://esw.w3.org/topic/RdfSmushing
> > > >
> > > > --
> > > > Greg Tyrelle, Ph.D.
> > > > Bioinformatics Department
> > > > Phalanx Biotech Group, Inc.
> > > > Hsinchu, Taiwan
> > > > Tel: 886-3-5781168 Ext.504
> > >
> > >
>
Received on Thursday, 23 August 2007 17:47:52 UTC