Re: URI aliases and owl:sameAs was: Terminology Question concerning Web Architecture and Linked Data from Alan Ruttenberg on 2007-07-28 (semantic-web@w3.org from July 2007)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Sat, 28 Jul 2007 00:27:59 -0400
To: Tim Berners-Lee <timbl@w3.org>
Cc: Chris Bizer <chris@bizer.de>, "SW-forum Web" <semantic-web@w3.org>, "Linking Open Data" <linking-open-data@simile.mit.edu>, "Jonathan A Rees" <jar@mumble.net>
Message-Id: <9E4454AC-1685-4E01-98E6-C765DCA829E8@gmail.com>
On Jul 27, 2007, at 2:06 PM, Tim Berners-Lee wrote:
> On 2007-07 -27, at 01:18, Alan Ruttenberg wrote:
>> While it is true that <http://dbpedia.org/resource/Tim_Berners- 
>> Lee> is also an identifier for me, there are good social reasons  
>> why someone might want to use one or the other.  I might want to  
>> introduce myself to someone (or log in at a system) making sure  
>> that the other party will have access to certain information.

I'm not following this example. If you want to introduce yourself in  
RDF and control what the other party knows, create an information  
resource under your own domain saying what you want them to know and  
log in with that. However, I don't see why that information resource  
would need a different name for you to accomplish what it needs to do.

> I might make a link from my data to one or other on the basis that  
> I am more convinced that one or other will be well-maintained.

One or the other what? Data? If they are providing data then that's  
what they should say their URIs denote. I see no reason to get this  
mixed up with the business of creating a new proper name for something.

> I'm not going to persuade people to use one or the other.  I do  
> link the one I control to the dbpedia one with owl:sameAs.

I've noticed. My preference would be to get together with the people  
that I work with and agree on the name we will use to talk about  
things we need to talk about. AWWW 2.3.1 pretty much summarize why  
that's a good idea.

> You say then,
>
>> I guess I am arguing that it is always a bad idea to mint your own  
>> URI if you believe that some other URI names exactly the thing  
>> that you are about to name with yours. So if there is a URI that  
>> you are sure identifies a specific person, then use that instead  
>> of inventing a new one. On the other hand, if you want to mint a  
>> URI that is a resource *about* that person, according to you, then  
>> it's fine to mint one for that - no one else can claim to have  
>> exactly the same resource about that person.
>
> I disagree.   I think that in general, there should be a small  
> number of URIs. In general, yes, it is good to use a well- 
> recognized one.  But there are cases when it makes sense to make an  
> identifier.
>
> I gave a talk a while ago at crossref.org, which maintains the doi:  
> set of Digital Object Identifiers for books.  ("TIP: You can turn a  
> DOI string into a URL by appending the DOI string to http:// 
> dx.doi.org/")   They said they had a big problem:  their databases  
> contain information connecting books and authors. They use dois for  
> the books, but what can they use for people?  There is no central  
> registry for people. They have no right to invent identifiers for  
> people.  They had run it not this problem because they were  
> thinking centralized, not weblike.    They had the model that there  
> should be on central name for a book (and they should run it).   
> This breaks because other people have their IS for everything too  
> -- no one can practically socially be the one central truth, and  
> that would be a fragile system (socially and technically)  if they  
> were.  But the good news if that they still provide a very valuable  
> function. They provide a source of stable URIs for books (alas no  
> RDF).

It seems to me that there is a whole bunch of ideas mixed together  
here. I'm not sure I've got it all, but I'll respond to what I can  
figure out.

1) I don't see how your story relates to your argument. You are  
arguing with my statement that aliases shouldn't be be created, but  
your illustration is of a case where there aren't names to alias to  
in the first place.

2) What does the name have to do with truth? A statement can be true,  
but a name either identifies something or it doesn't. It seems to me  
that if you believe that the name identifies the thing enough to say  
sameAs, then that pretty much settles it.  Now in the business I am  
in this tends not to be the case - it is more often the case that  
someone thinks they have identified something and named it but when I  
probe for a simple question: "What is it that you have identified?"  
they can't give me a straight answer. So in that case I don't use  
sameAs, since I tend to be pretty clear about what I want to say.

3) "This breaks because other people have their IS for everything  
too": I don't know what "IS" stands for here.

4) The very valuable function - providing a name that we all use to  
identify something, and being clear enough when they created those  
names that many people agree that this is a good name - is exactly  
what I want to encourage. Later on, after many people use the name,  
we have the desired robustness. People who depend on using that name  
will be sure to keep a definition around. People wondering what the  
name is will be able to look at usage and figure out what it's about.  
It doesn't really matter whether or not crossref serves any  
information about the name. In fact if they changed management and  
started publishing spam and bogus information at those URIs I expect  
the community would simply ignore them thereafter.
I fail to see any robustness in the system that will arise by any  
other means than this. Moreover I *can* see the damage that will  
arise if people willy nilly keep creating new URIs to denote the same  
thing. I thought that's what motivated 2.3.1

4) As far as the URIs for people go, if there were reasonable one  
already present then crossref would probably have used them. I expect  
that while a small percentage authors who are SW enthusiasts do have  
such URIs, the cost of obtaining them was considered prohibitive. In  
these circumstances we could reasonably encourage crossref to create  
URIs for people so that they could have good coverage. If they did  
that job well enough, then I'd encourage other people to use those  
names, not invent new ones.

> The other good news is that crossref CAN make URIs for people.   
> They can perform the incredibly valuable function of  
> disambiguating, within that community, the various people with  
> similar names. They can make RDF IDs for them.  If they are very on  
> the ball, they will even allow author to store another RDF ID, like  
> they FOAF ID, in the crossref database, just like allowing an  
> author to link to their own homepage.

I think it is appropriate to use sameAs in such situations. And I've  
already said that I agree with the necessity of minting URIs if there  
aren't already good ones, or in this case, where the cost/benefit  
doesn't warrant them collecting the very small and not so easy to  
find set of already existing URIs.

> So, how does this relate to the Science commons?  I think the life  
> sciences folks should not hold their breath until there is a unique  
> identifier for each protein, an a unique concept for what a  
> "protein" is exactly.

First, they aren't holding their breath(e.g. http://purl.uniprot.org/ 
uniprot/P42858). Second, the issue isn't what the concept of protein  
is, but rather good engineering practice in representing the contents  
of their databases in such a way as maximize their sensibility to a  
stupid machine. It's not that hard - their curators have done all the  
heavy lifting to gather statements about the variety of different  
sets of proteins(splice variants, mutations variants) that each  
current record talks about. Now all that is needed to encode that  
information using the best current standard SW language for the job:  
OWL.

> They should serve up the actual records about these things as  
> documents, with known provenance and features and failings.

No one has argued that they shouldn't. What I have argued is that  
they try to convince people that the have provided an adequate name  
for a clear delineated thing or set of things in the world. There are  
people who want to do that - let them name the things, and Uniprot  
name the records.

Also note that Science Commons is trying to support science. I've  
spent the last bunch of years serving information to scientists who  
need better that what is currently being provided. Last week I spend  
a couple of days at workshop where a group of biologists,  
experimentalists, and informaticians strategized how to organize and  
use information to help find drugs to treat and cure Huntington's  
disease. One of the cries: "Stop only recording that you used   
'Huntingtin' in your experiment. Tell me whether it was exon 1 of  
Huntingtin, or what fragment, or if was wild type huntingtin, or  
which mutated version..." I need to have names for these things in  
order to support this effort.

> So a protein may get Ids in  uniprot and in the Gene Ontology,  
> where the mapping isn't 100% crystal clear.

Look, that may be fine for you, and fine in some case, but it isn't  
fine in many cases. These decisions have consequences. A diamond  
sellers won't tolerate fuzziness or looseness in the quality  
definitions of their diamonds if that's what determines the price.  
It's not OK for them to just advertise "Diamond: good quality". They  
want to say I have a 3.2 carat pear cut IF clarity H color diamond,  
and if that is misinterpreted by a SW machine agent  to be a VS1  
clarity P color diamond in some transaction they would lose money.

> And then  mapping files can be provided where the mappings exist.  
> This allows each data source to change if necessary, as new   
> understandings arise.

What will the nature of these mappings be? It isn't a simple matter  
of saying this URI maps to that. There's a lot of structure - If one  
database defines 'Huntingtin' protein as any protein that is  
generated by rna that is transcribed from a genetic locus, then  
that's one set of things. If another defines it as a set of proteins  
that has a certain primary structure (i.e. specific sequence of amino  
acids), then that's another set of things. It another talks about  
certain things that happen to the "mature" form of some protein  
(often a fragment or otherwise modified from the original) this is  
about a different set of things. The sets (classes) relate to each  
other in that some are subsets of others and some don't share any  
members.

Databases like Uniprot and Entrez gene have enough information in  
them to define those sets in the first place. That's what I'm  
encouraging them to do. If they do that well enough I'd recommend  
that others that need to talk about these sets of things use the  
names that are minted in that process. That's just a matter of good  
engineering, the sort that is encouraged in 2.3.1

> The system must not be so rigidly connected that nothing can grow.

Where in anything I have said is there any indication that I am  
recommending something like that? I think that the strategy we're  
pursuing is causing pretty nice growth - DERI recently put out a map  
that showed some of the common classes and relations on the semantic  
web, and there were a surprising number of them that were generated  
in the process of doing the W3C HCLS demo. And we've just started.

> The service of the data should be maintained by the organization  
> which maintains the data, after an initial period when people  
> externally show them how it is done. (like biordf and bio2rdf).

There are several parts of this. What the names are, how one uses  
them to get information, and what that information is. I think it is  
very helpful, at least at the outset, if the name can be used in a  
simple way to guide one to the information about the name.

While I would like the information about the name to be served by an  
organization that collects and maintains such knowledge, my current  
thinking is that the names that science uses are important enough  
that they shouldn't be under the control of any specific provider. I  
think it ought to be more like collaborative ontology building, and  
the resultant names  be in the control of the community that depends  
on them. In the past we have suffered too many cases of 404 that we  
aren't in a position to repair. That's why I (in my role as a  
technical person at Science Commons) have chosen to use purls for  
names and means of locating information, and to suggest that the  
administration of these purls be by a community of users. Should some  
database of important entities disappear from the web, we at least  
then have a chance to resurrect a copy and redirect the purls to that  
copy without incurring the cost of minting a new set of names (well  
outlined in 2.3.1)

YMMV.

> Once these identifiers cease to be the one and only central ID,  
> then they can be minted pragmatically.

I don't understand this. Would it be possible to explain what you mean?

> I'm not going to jump up and down about the # vs /  but I think  
> when the data is presented as a set of records which I seem to  
> remember it is, the # approach might seem less weird to you.

As you know,  I am jumping up and down a bit about hashes being a bad  
idea. I won't repeat myself here - I think I've laid out why in  
previous email, and we've discussed this in person recently.

I'm not understanding what you are getting at about the records and  
connection to hash versus slash though.

Your thoughts on my response would, of course, be very much  
appreciated - thanks for your comments!

Best,
Alan
Received on Saturday, 28 July 2007 04:28:22 UTC