Re: what would change for me? from Marc-Alexandre Nolin on 2007-11-01 (public-semweb-lifesci@w3.org from November 2007)

From: Marc-Alexandre Nolin <lotus@ieee.org>
Date: Thu, 1 Nov 2007 02:31:58 -0400
To: public-semweb-lifesci@w3.org
Cc: "Jonathan Rees" <jar@creativecommons.org>
Message-ID: <d6a9bb0d0710312331k686f2991pfd369df828c8c4cc@mail.gmail.com>
Hi,

The following are my comments about the TNS draft at
http://sw.neurocommons.org/2007/uri-note/ and Major remaining trouble
spots from http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Tasks/URI_Best_Practices/Recommendations

To begin with, from the question about "Attitude Toward Nonlocators"
in the major remaining trouble is that HTTP is OK. I use http
identifier in Bio2RDF.org the same way Purl.org do ; with a REST like
interface (http://purl.org/commons/xml/pmid/PM15548600 or
http://bio2rdf.org/xml/pubmed:15548600). Also, many public ontologies
like RDF, OWL are http base and we can already handle them. If we are
to choose a string of characters to be an URI to identify an item of
life sciences, I just find it logical to get the method of retrieval
at the same time as I get the identifier.

Another major point is about Racine Sharing with the #. I strongly
discourage this practice for big knowledge base. It is only usable
with little amount of instance. For example PubChem, if we use Racine
Sharing, an URIs would look like
http://view.ncbi.nlm.nih.gov/pccompound#id. The problem is, there are
17 millions ids that take about 32 Gb of gziped XML. The retrieval
would be awfully long.

Since the specific question of Jonathan is about what to put between
de // and the first /, I would say that Purl.org is the best
compromise because it has the infrastructure already in place, is open
and offer a more neutral ground than other proxy like Bio2RDF.org
because it's sciences commons. Big data provider (Uniprot, NCBI, EBI,
Kegg, etc) might probably do without it because they have the
capability to handle the data themself (like Uniprot
http://purl.uniprot.org/uniprot/P19367.rdf . Purl is in the URI, but
as a sub-domain of uniprot and not purl.org itself), but small
provider migth found with the purl.org solution a convenient way to
create and managed URIs. Purl.org (or Bio2RDF.org for some data
provider) is also a good way to retrieve RDF from provider that don't
produce RDF thenself yet, maybe someone elsewhere does and we can
redirect to it while waiting for the official source to do it.

But what is between the // and the first / isn't that important in the
end. There will be many domain that will provide RDF, be it as a proxy
that give RDF from a none RDF source or as a LSID resolver like
http://lsid.biopathways.org/resolver/. That's what come after the
first / that is a problem. What I would really like to see is simply a
web page on a data provider web site explaining how people should
refers to their content with URIs. The data provider would need to
provide some kind of commitment about keeping these URIs as stable as
possible.

A page like this on Uniprot would look like this:

To refers to a Uniprot item write it this way
http://purl.uniprot.org/<database>/<id><.service>
where database could be one of {uniprot | citations | etc }
id is the identifier of the item and .service, what we want to receive
from this id {xml | text | rdf | fasta | etc }. All of this string
must be in lowercase

The same page from NCBI could look exactly like
http://view.ncbi.nlm.nih.gov/ but in the verb slot, we would add
different format retrieval like rdf, xml, asn.1, etc.

If another data provider publish a similar page and use purl.org
scheme instead of his own domain, so be it, as long as it is detailled
correctly.

Now everyone that follow the rules about how to refers to an items
from a specific data provider with an URI will connect together
easily. This would render Bio2RDF mostly obselette because one of the
added values that Bio2RDF give is the rewriting of URIs into its own
namespace to be consistent from one document to another to create a
web of linked data where there was none.

For example, take this RDF document from Uniprot
http://purl.uniprot.org/uniprot/P19367.rdf and look at the entry
http://purl.uniprot.org/geneid/3098. If NCBI would have publish RDF
URIs of there data, the URI here might be
http://view.ncbi.nlm.nih.gov/gene/rdf/3098. This, without anything to
add in between like lsid resolver, 303 redirect or #, will create
linked data.

That being said, I know that NCBI doesn't provide RDF version of their
data yet and what I just wrote does not actually work, but if I put
this in context of the draft which is a recommendation about best
practice to mint URIs, this make sense.

In conclusion, I support Http URIs. I strongly discourage Racine
Sharing. We can't control what will be between the // and the first /,
but as a recommendation for research center, without big IT budget, to
create new URIs as soon as possible, I would recommend Purl.org. I'm
for simple rules on a per data provider basis available on their web
site (these rules could also be written in RDF, I don't see any
problem with that). When a data manager have to create a triplestore
and he know he will write PubMed paper and Uniprot protein, he go to
these site and see how to refers to these entities with URIs. Now his
triplestore is already usable in linked data.

thanks,

Marc-Alexandre Nolin

P.S.:I apologize for my bad english. I wish my reflexion wasn't blur
because of it. If clarification is needed, just ask me for it.

2007/10/29, Jonathan Rees <jar@creativecommons.org>:
> On Oct 23, 2007, at 9:58 AM, Marc-Alexandre Nolin wrote:
>
> > Currently, I'm waiting for the publication of Jonathan URI
> > recommendation to add it to the Bio2RDF system. Adding the support to
> > the standardization effort doesn't mean to throw away the previous
> > working system :)
> >
> > Marc-Alexandre
>
> I appreciate your confidence!  I am hoping to release a draft of the
> URI note to HCLS at the end of this week. It would be extremely
> helpful to me if you would give your advice on common names for
> public database records.  I think you have seen the science commons
> proposal, and your comments on that would be interesting. I have a
> "major issue" page on this topic:
>    http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Tasks/
> URI_Best_Practices/Recommendations/PublicResources
>
> Since yours is the only other careful effort I know of along these
> lines, I'd be interested to know whether you recommend what you have
> for HCLS purposes, and what would be required to reconcile bio2rdf
> with purl.org/commons (besides finishing the implementation of the
> latter by making it yield RDF). I'm particularly interested in
> opinions on what goes between the // and the first /.
>
> Jonathan
>
>
Received on Thursday, 1 November 2007 06:32:09 UTC