RE: making statements on the semantic web from Michel_Dumontier on 2007-08-09 (public-semweb-lifesci@w3.org from August 2007)

From: Michel_Dumontier <Michel_Dumontier@carleton.ca>
Date: Thu, 09 Aug 2007 09:54:20 -0400
To: gregtyrelle@phalanxbiotech.com
Cc: public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>
Message-id: <AB349814F1ECB143A5D4CD29C7A645690192DCBA@CCSEXB10.CUNET.CARLETON.CA>
Hi Greg,

 Yes, we want to make statements about genes/proteins, not html pages.
For instance, the genomic feature identified by S000001855, or the
protein identified by YHR023W. What are the URI of these? 

> A centralized registry, PURL schemes etc. have been suggested, and
> they will *potentially* solve this problem,

Ooooh - do tell!

> The zen moment is, you are an authority, just not the authority. In
> which case it doesn't matter. Create URIs in your own namespace for
> whatever non-information resources you want, proteins, genes etc. and
> worry about the data integration problem after the fact. After all RDF
> itself does not do data integration, it just facilitates data
> integration. If your URI identifiers contain SGD gene names or other
> database identifiers, then direct identifier mapping should be
> feasible. If not various smushing [1] techniques could be employed.
>

Yes - this is the crux of the problem. Data integration has
traditionally been done by "mapping" one identifier to another. I've
been doing that for 7 years now, and it's getting harder and harder with
the increase in the type and quantity of data, as well as the increase
in third party annotation on existing resources. I've got tables with
hundreds of millions of rows to map identifiers for identical resources.
I believe this to be 1) unproductive, 2) costly and 3) unnecessary. 

The semantic web framework provides the capability to trivially
integrate data with named resources.  That is a major benefit of the
technology, which we would be aptly sidestepping by each of us minting
our own identifiers and then doing the n^2/n mappings to finally
integrate all the data. It is significantly more feasible and cost
effective to setup a global registry and have it enforced via journals
(as has been successfully done with sequence data). One registry, one
global identifier... it's really simple, authoratative, and makes
everybody's life about a hundred million times easier.

-=Michel=-
 
Michel Dumontier
Assistant Professor of Bioinformatics
 
Department of Biology, School of Computer Science, Institute of
Biochemistry 
Carleton University 

Member of the Ottawa Institute of Systems Biology 
Member of the Ottawa-Carleton Institute for Biomedical Engineering
 
Office: 4610 Carleton Technology and Training Center
Mailing: 209 Nesbitt, 1125 Colonel By Drive, Ottawa, ON K1S5B6
Tel:  +1 (613) 520-2600 x4194
Fax:  +1 (613) 520-3539
Web:  http://dumontierlab.com
Skype: micheldumontier

> -----Original Message-----
> From: greg.tyrelle@gmail.com [mailto:greg.tyrelle@gmail.com] On Behalf
Of
> Greg Tyrelle
> Sent: Thursday, August 09, 2007 6:42 AM
> To: Michel_Dumontier
> Cc: public-semweb-lifesci hcls
> Subject: Re: making statements on the semantic web
> 
> On 8/7/07, Michel_Dumontier <Michel_Dumontier@carleton.ca> wrote:
> >   So a key concern for me is how I, as a user of public resources,
> > should make statements about them on the semantic web. While certain
> > data providers might already providing RDF/OWL data with some URI,
what
> > about those that have yet to do this? How should I reference a
public
> > resource provided by the SGD [1] or candidadb [2]? Moreover, what
about
> > the ~1000 database [3] with valuable content, much of it locked away
in
> > relational databases or flat files? How do I make statements about
these
> > resources, without taking the responsibility of serving it up in my
own
> > namespace [4], which might ultimately not integrate with content
from
> > another 3rd party content provider.
> 
> Do you want to make statements about the HTML representation of the
> database records in SGD ? I will assume this is not the case as these
> records already have URL identifiers. Or do you want to make
> statements about yeast proteins/genes, where SGD is likely to be the
> authority for providing stable identifiers for said proteins/genes ?
> 
> If it is the second case, and if I understand you correctly, then your
> problem is that currently SGD does not provide stable URIs for yeast
> genes (non-information resources, not database records), but
> nonetheless you want to make statements about these non-information
> resources now, without creating further data integration hassles by
> minting your own identifiers for these non-information resources which
> will ultimately be equivalent to the identifiers provided by SGD, if
> and when they do start providing these stable identifiers ?
> 
> >   Inline with my previous comments about the value of the semantic
web
> > for data integration, it would be of great value to have data
providers
> > _register_ the namespace of their resources. In fact, coupling NAR
> > database issue with base URI registration would open up entirely new
> > worlds for data integration. Do you think this is worthwhile or
> > feasible? What other approaches might be considered to alleviate
this
> > problem?
> 
> A centralized registry, PURL schemes etc. have been suggested, and
> they will *potentially* solve this problem, but they don't help a
> yeast biologist from making statements about the yest protein GCN4,
> right now. Which stable URI should you use for that protein if one
> doesn't already exist and you're not the authority ? You don't want to
> wait for one to be made available...
> 
> The zen moment is, you are an authority, just not the authority. In
> which case it doesn't matter. Create URIs in your own namespace for
> whatever non-information resources you want, proteins, genes etc. and
> worry about the data integration problem after the fact. After all RDF
> itself does not do data integration, it just facilitates data
> integration. If your URI identifiers contain SGD gene names or other
> database identifiers, then direct identifier mapping should be
> feasible. If not various smushing [1] techniques could be employed.
> 
> _greg
> 
> [1] http://esw.w3.org/topic/RdfSmushing
> 
> --
> Greg Tyrelle, Ph.D.
> Bioinformatics Department
> Phalanx Biotech Group, Inc.
> Hsinchu, Taiwan
> Tel: 886-3-5781168 Ext.504
Received on Thursday, 9 August 2007 13:54:42 UTC