Re: making statements on the semantic web from Marc-Alexandre Nolin on 2007-08-20 (public-semweb-lifesci@w3.org from August 2007)

From: Marc-Alexandre Nolin <lotus@ieee.org>
Date: Mon, 20 Aug 2007 15:11:27 -0400
To: Michel_Dumontier <Michel_Dumontier@carleton.ca>
Cc: gregtyrelle@phalanxbiotech.com, "public-semweb-lifesci hcls" <public-semweb-lifesci@w3.org>
Message-ID: <d6a9bb0d0708201211m77c94bd6tbc7ea07f523b1a66@mail.gmail.com>
Hi,

Sorry, I miss a conversation implicating Bio2RDF on the mailing list,
but I was in vacation in the last 2 weeks.

You can see Bio2RDF as a proxy for "LINKED" RDF documents. That is,
some data provider give access to a RDF data representation of their
data, like Uniprot, and we are very thankful to them, this make our
job much easier. Other data provider, maybe because of their scale
(small lab with not many informatics resources) or lack of interest,
don't provide RDF version of their data.

On problem come when an RDF document link out to an identifier of
another data provider, which doesn't have anything that answer to this
URI. Sometimes, the link out data providers does have a URI, but it is
not well written (case error, HTTP URL error, no LSID resolver to
resolve it, etc.). Again, in this case, the URI won't connect and the
document won't be linked together.

When we integrate a new data source into Bio2RDF, we make sure that
any URI that we create follow a simple set of rules that we put in
place. Then, automagically, it connect, all of them.

There are two way we transform data: live from the sources or in
batch. But in either the case, we keep the references to the original
unmodified document because we do modified them for the data to link
together. The reason we may do the RDFization in batch instead of live
from the sources is because of the scale of the original database,
limitation for number of request by minute (NCBI) or the available
format from the sources.

Just to add to a comment of Matthias, it's very sure that Bio2RDF.org
don't have the reputation of PURL.org. We also don't provide the same
services than PURL provide, but we are open to discussion, If someone
want to add to the RDFizer we already have, we welcome them. All of
our RDFizer we already have are available in Open Sources. You can
install locally a Bio2RDF.org system in your local private
infrastructure, name locally the computer Bio2RDF.org and it will also
work if you fear that we may go down. Many Bio2RDF.org may exist, an
be chain together by the urlRewrite filter if the local one can't
answer to the query. Has for going down, if people start to use us
enough, it will be an happy problem, money won't be that much of a
problem in this case. :)

Also, we currently use our own set of rules scheme for the definition
of our URIs since there is none currently (see http://bio2rdf.org/wiki
for the rules). Note that the set of rules are in a wiki, if someone
want to add comments or request a change of representation, it can be
add there. We even provide LSID resolver possibility for it because it
is not a close discussion on the subject. And when a URI
recommendation will come from Jonathan Rees, we will implement it as
well. I won't lie to anybody, I would prefer to go with a HTTP URIs
instead of an LSID because a would prefer to work without the extra
resolver layer above the DNS, but if the community really push for it,
I will go for it. I assure you, we try to be as neutral as we can be
:)

In the end, what are the probabilities that Purl.org or Bio2RDF.org
will be there in 5 years, 10 years, 20 years from here? Granted, one
have a greater probability to still be there, but the probability of
not being there is also not zero.

If all data provider would already provide RDF document with URIs all
written with the same set of rules for them to connect, Bio2RDF would
not need to exist, neither HCLS for that matter. We provide a service
for something that is not currently there: linked data in
bioinformatics. We are like a proof of concept to say to the world
"Hey, if you just write your URIs with a simple set of rules, it work,
it will all connect together, the HCLS demo also show that". But we
are more than a proof of concept, we are already in use in our
research center. We create custom knowledge base of linked data for
research group here.

Bye !!

Marc-Alexandre

2007/8/9, Michel_Dumontier <Michel_Dumontier@carleton.ca>:
>
> Hi Greg,
>
>  Yes, we want to make statements about genes/proteins, not html pages.
> For instance, the genomic feature identified by S000001855, or the
> protein identified by YHR023W. What are the URI of these?
>
> > A centralized registry, PURL schemes etc. have been suggested, and
> > they will *potentially* solve this problem,
>
> Ooooh - do tell!
>
> > The zen moment is, you are an authority, just not the authority. In
> > which case it doesn't matter. Create URIs in your own namespace for
> > whatever non-information resources you want, proteins, genes etc. and
> > worry about the data integration problem after the fact. After all RDF
> > itself does not do data integration, it just facilitates data
> > integration. If your URI identifiers contain SGD gene names or other
> > database identifiers, then direct identifier mapping should be
> > feasible. If not various smushing [1] techniques could be employed.
> >
>
> Yes - this is the crux of the problem. Data integration has
> traditionally been done by "mapping" one identifier to another. I've
> been doing that for 7 years now, and it's getting harder and harder with
> the increase in the type and quantity of data, as well as the increase
> in third party annotation on existing resources. I've got tables with
> hundreds of millions of rows to map identifiers for identical resources.
> I believe this to be 1) unproductive, 2) costly and 3) unnecessary.
>
> The semantic web framework provides the capability to trivially
> integrate data with named resources.  That is a major benefit of the
> technology, which we would be aptly sidestepping by each of us minting
> our own identifiers and then doing the n^2/n mappings to finally
> integrate all the data. It is significantly more feasible and cost
> effective to setup a global registry and have it enforced via journals
> (as has been successfully done with sequence data). One registry, one
> global identifier... it's really simple, authoratative, and makes
> everybody's life about a hundred million times easier.
>
> -=Michel=-
>
> Michel Dumontier
> Assistant Professor of Bioinformatics
>
> Department of Biology, School of Computer Science, Institute of
> Biochemistry
> Carleton University
>
> Member of the Ottawa Institute of Systems Biology
> Member of the Ottawa-Carleton Institute for Biomedical Engineering
>
> Office: 4610 Carleton Technology and Training Center
> Mailing: 209 Nesbitt, 1125 Colonel By Drive, Ottawa, ON K1S5B6
> Tel:  +1 (613) 520-2600 x4194
> Fax:  +1 (613) 520-3539
> Web:  http://dumontierlab.com
> Skype: micheldumontier
>
> > -----Original Message-----
> > From: greg.tyrelle@gmail.com [mailto:greg.tyrelle@gmail.com] On Behalf
> Of
> > Greg Tyrelle
> > Sent: Thursday, August 09, 2007 6:42 AM
> > To: Michel_Dumontier
> > Cc: public-semweb-lifesci hcls
> > Subject: Re: making statements on the semantic web
> >
> > On 8/7/07, Michel_Dumontier <Michel_Dumontier@carleton.ca> wrote:
> > >   So a key concern for me is how I, as a user of public resources,
> > > should make statements about them on the semantic web. While certain
> > > data providers might already providing RDF/OWL data with some URI,
> what
> > > about those that have yet to do this? How should I reference a
> public
> > > resource provided by the SGD [1] or candidadb [2]? Moreover, what
> about
> > > the ~1000 database [3] with valuable content, much of it locked away
> in
> > > relational databases or flat files? How do I make statements about
> these
> > > resources, without taking the responsibility of serving it up in my
> own
> > > namespace [4], which might ultimately not integrate with content
> from
> > > another 3rd party content provider.
> >
> > Do you want to make statements about the HTML representation of the
> > database records in SGD ? I will assume this is not the case as these
> > records already have URL identifiers. Or do you want to make
> > statements about yeast proteins/genes, where SGD is likely to be the
> > authority for providing stable identifiers for said proteins/genes ?
> >
> > If it is the second case, and if I understand you correctly, then your
> > problem is that currently SGD does not provide stable URIs for yeast
> > genes (non-information resources, not database records), but
> > nonetheless you want to make statements about these non-information
> > resources now, without creating further data integration hassles by
> > minting your own identifiers for these non-information resources which
> > will ultimately be equivalent to the identifiers provided by SGD, if
> > and when they do start providing these stable identifiers ?
> >
> > >   Inline with my previous comments about the value of the semantic
> web
> > > for data integration, it would be of great value to have data
> providers
> > > _register_ the namespace of their resources. In fact, coupling NAR
> > > database issue with base URI registration would open up entirely new
> > > worlds for data integration. Do you think this is worthwhile or
> > > feasible? What other approaches might be considered to alleviate
> this
> > > problem?
> >
> > A centralized registry, PURL schemes etc. have been suggested, and
> > they will *potentially* solve this problem, but they don't help a
> > yeast biologist from making statements about the yest protein GCN4,
> > right now. Which stable URI should you use for that protein if one
> > doesn't already exist and you're not the authority ? You don't want to
> > wait for one to be made available...
> >
> > The zen moment is, you are an authority, just not the authority. In
> > which case it doesn't matter. Create URIs in your own namespace for
> > whatever non-information resources you want, proteins, genes etc. and
> > worry about the data integration problem after the fact. After all RDF
> > itself does not do data integration, it just facilitates data
> > integration. If your URI identifiers contain SGD gene names or other
> > database identifiers, then direct identifier mapping should be
> > feasible. If not various smushing [1] techniques could be employed.
> >
> > _greg
> >
> > [1] http://esw.w3.org/topic/RdfSmushing
> >
> > --
> > Greg Tyrelle, Ph.D.
> > Bioinformatics Department
> > Phalanx Biotech Group, Inc.
> > Hsinchu, Taiwan
> > Tel: 886-3-5781168 Ext.504
>
>
Received on Monday, 20 August 2007 19:11:33 UTC