- From: Alan Ruttenberg <alanruttenberg@gmail.com>
- Date: Sat, 28 Jul 2007 00:27:59 -0400
- To: Tim Berners-Lee <timbl@w3.org>
- Cc: Chris Bizer <chris@bizer.de>, "SW-forum Web" <semantic-web@w3.org>, "Linking Open Data" <linking-open-data@simile.mit.edu>, "Jonathan A Rees" <jar@mumble.net>
On Jul 27, 2007, at 2:06 PM, Tim Berners-Lee wrote:
> On 2007-07 -27, at 01:18, Alan Ruttenberg wrote:
>> While it is true that <http://dbpedia.org/resource/Tim_Berners-
>> Lee> is also an identifier for me, there are good social reasons
>> why someone might want to use one or the other. I might want to
>> introduce myself to someone (or log in at a system) making sure
>> that the other party will have access to certain information.
I'm not following this example. If you want to introduce yourself in
RDF and control what the other party knows, create an information
resource under your own domain saying what you want them to know and
log in with that. However, I don't see why that information resource
would need a different name for you to accomplish what it needs to do.
> I might make a link from my data to one or other on the basis that
> I am more convinced that one or other will be well-maintained.
One or the other what? Data? If they are providing data then that's
what they should say their URIs denote. I see no reason to get this
mixed up with the business of creating a new proper name for something.
> I'm not going to persuade people to use one or the other. I do
> link the one I control to the dbpedia one with owl:sameAs.
I've noticed. My preference would be to get together with the people
that I work with and agree on the name we will use to talk about
things we need to talk about. AWWW 2.3.1 pretty much summarize why
that's a good idea.
> You say then,
>
>> I guess I am arguing that it is always a bad idea to mint your own
>> URI if you believe that some other URI names exactly the thing
>> that you are about to name with yours. So if there is a URI that
>> you are sure identifies a specific person, then use that instead
>> of inventing a new one. On the other hand, if you want to mint a
>> URI that is a resource *about* that person, according to you, then
>> it's fine to mint one for that - no one else can claim to have
>> exactly the same resource about that person.
>
> I disagree. I think that in general, there should be a small
> number of URIs. In general, yes, it is good to use a well-
> recognized one. But there are cases when it makes sense to make an
> identifier.
>
> I gave a talk a while ago at crossref.org, which maintains the doi:
> set of Digital Object Identifiers for books. ("TIP: You can turn a
> DOI string into a URL by appending the DOI string to http://
> dx.doi.org/") They said they had a big problem: their databases
> contain information connecting books and authors. They use dois for
> the books, but what can they use for people? There is no central
> registry for people. They have no right to invent identifiers for
> people. They had run it not this problem because they were
> thinking centralized, not weblike. They had the model that there
> should be on central name for a book (and they should run it).
> This breaks because other people have their IS for everything too
> -- no one can practically socially be the one central truth, and
> that would be a fragile system (socially and technically) if they
> were. But the good news if that they still provide a very valuable
> function. They provide a source of stable URIs for books (alas no
> RDF).
It seems to me that there is a whole bunch of ideas mixed together
here. I'm not sure I've got it all, but I'll respond to what I can
figure out.
1) I don't see how your story relates to your argument. You are
arguing with my statement that aliases shouldn't be be created, but
your illustration is of a case where there aren't names to alias to
in the first place.
2) What does the name have to do with truth? A statement can be true,
but a name either identifies something or it doesn't. It seems to me
that if you believe that the name identifies the thing enough to say
sameAs, then that pretty much settles it. Now in the business I am
in this tends not to be the case - it is more often the case that
someone thinks they have identified something and named it but when I
probe for a simple question: "What is it that you have identified?"
they can't give me a straight answer. So in that case I don't use
sameAs, since I tend to be pretty clear about what I want to say.
3) "This breaks because other people have their IS for everything
too": I don't know what "IS" stands for here.
4) The very valuable function - providing a name that we all use to
identify something, and being clear enough when they created those
names that many people agree that this is a good name - is exactly
what I want to encourage. Later on, after many people use the name,
we have the desired robustness. People who depend on using that name
will be sure to keep a definition around. People wondering what the
name is will be able to look at usage and figure out what it's about.
It doesn't really matter whether or not crossref serves any
information about the name. In fact if they changed management and
started publishing spam and bogus information at those URIs I expect
the community would simply ignore them thereafter.
I fail to see any robustness in the system that will arise by any
other means than this. Moreover I *can* see the damage that will
arise if people willy nilly keep creating new URIs to denote the same
thing. I thought that's what motivated 2.3.1
4) As far as the URIs for people go, if there were reasonable one
already present then crossref would probably have used them. I expect
that while a small percentage authors who are SW enthusiasts do have
such URIs, the cost of obtaining them was considered prohibitive. In
these circumstances we could reasonably encourage crossref to create
URIs for people so that they could have good coverage. If they did
that job well enough, then I'd encourage other people to use those
names, not invent new ones.
> The other good news is that crossref CAN make URIs for people.
> They can perform the incredibly valuable function of
> disambiguating, within that community, the various people with
> similar names. They can make RDF IDs for them. If they are very on
> the ball, they will even allow author to store another RDF ID, like
> they FOAF ID, in the crossref database, just like allowing an
> author to link to their own homepage.
I think it is appropriate to use sameAs in such situations. And I've
already said that I agree with the necessity of minting URIs if there
aren't already good ones, or in this case, where the cost/benefit
doesn't warrant them collecting the very small and not so easy to
find set of already existing URIs.
> So, how does this relate to the Science commons? I think the life
> sciences folks should not hold their breath until there is a unique
> identifier for each protein, an a unique concept for what a
> "protein" is exactly.
First, they aren't holding their breath(e.g. http://purl.uniprot.org/
uniprot/P42858). Second, the issue isn't what the concept of protein
is, but rather good engineering practice in representing the contents
of their databases in such a way as maximize their sensibility to a
stupid machine. It's not that hard - their curators have done all the
heavy lifting to gather statements about the variety of different
sets of proteins(splice variants, mutations variants) that each
current record talks about. Now all that is needed to encode that
information using the best current standard SW language for the job:
OWL.
> They should serve up the actual records about these things as
> documents, with known provenance and features and failings.
No one has argued that they shouldn't. What I have argued is that
they try to convince people that the have provided an adequate name
for a clear delineated thing or set of things in the world. There are
people who want to do that - let them name the things, and Uniprot
name the records.
Also note that Science Commons is trying to support science. I've
spent the last bunch of years serving information to scientists who
need better that what is currently being provided. Last week I spend
a couple of days at workshop where a group of biologists,
experimentalists, and informaticians strategized how to organize and
use information to help find drugs to treat and cure Huntington's
disease. One of the cries: "Stop only recording that you used
'Huntingtin' in your experiment. Tell me whether it was exon 1 of
Huntingtin, or what fragment, or if was wild type huntingtin, or
which mutated version..." I need to have names for these things in
order to support this effort.
> So a protein may get Ids in uniprot and in the Gene Ontology,
> where the mapping isn't 100% crystal clear.
Look, that may be fine for you, and fine in some case, but it isn't
fine in many cases. These decisions have consequences. A diamond
sellers won't tolerate fuzziness or looseness in the quality
definitions of their diamonds if that's what determines the price.
It's not OK for them to just advertise "Diamond: good quality". They
want to say I have a 3.2 carat pear cut IF clarity H color diamond,
and if that is misinterpreted by a SW machine agent to be a VS1
clarity P color diamond in some transaction they would lose money.
> And then mapping files can be provided where the mappings exist.
> This allows each data source to change if necessary, as new
> understandings arise.
What will the nature of these mappings be? It isn't a simple matter
of saying this URI maps to that. There's a lot of structure - If one
database defines 'Huntingtin' protein as any protein that is
generated by rna that is transcribed from a genetic locus, then
that's one set of things. If another defines it as a set of proteins
that has a certain primary structure (i.e. specific sequence of amino
acids), then that's another set of things. It another talks about
certain things that happen to the "mature" form of some protein
(often a fragment or otherwise modified from the original) this is
about a different set of things. The sets (classes) relate to each
other in that some are subsets of others and some don't share any
members.
Databases like Uniprot and Entrez gene have enough information in
them to define those sets in the first place. That's what I'm
encouraging them to do. If they do that well enough I'd recommend
that others that need to talk about these sets of things use the
names that are minted in that process. That's just a matter of good
engineering, the sort that is encouraged in 2.3.1
> The system must not be so rigidly connected that nothing can grow.
Where in anything I have said is there any indication that I am
recommending something like that? I think that the strategy we're
pursuing is causing pretty nice growth - DERI recently put out a map
that showed some of the common classes and relations on the semantic
web, and there were a surprising number of them that were generated
in the process of doing the W3C HCLS demo. And we've just started.
> The service of the data should be maintained by the organization
> which maintains the data, after an initial period when people
> externally show them how it is done. (like biordf and bio2rdf).
There are several parts of this. What the names are, how one uses
them to get information, and what that information is. I think it is
very helpful, at least at the outset, if the name can be used in a
simple way to guide one to the information about the name.
While I would like the information about the name to be served by an
organization that collects and maintains such knowledge, my current
thinking is that the names that science uses are important enough
that they shouldn't be under the control of any specific provider. I
think it ought to be more like collaborative ontology building, and
the resultant names be in the control of the community that depends
on them. In the past we have suffered too many cases of 404 that we
aren't in a position to repair. That's why I (in my role as a
technical person at Science Commons) have chosen to use purls for
names and means of locating information, and to suggest that the
administration of these purls be by a community of users. Should some
database of important entities disappear from the web, we at least
then have a chance to resurrect a copy and redirect the purls to that
copy without incurring the cost of minting a new set of names (well
outlined in 2.3.1)
YMMV.
> Once these identifiers cease to be the one and only central ID,
> then they can be minted pragmatically.
I don't understand this. Would it be possible to explain what you mean?
> I'm not going to jump up and down about the # vs / but I think
> when the data is presented as a set of records which I seem to
> remember it is, the # approach might seem less weird to you.
As you know, I am jumping up and down a bit about hashes being a bad
idea. I won't repeat myself here - I think I've laid out why in
previous email, and we've discussed this in person recently.
I'm not understanding what you are getting at about the records and
connection to hash versus slash though.
Your thoughts on my response would, of course, be very much
appreciated - thanks for your comments!
Best,
Alan
Received on Saturday, 28 July 2007 04:28:22 UTC