- From: Alan Ruttenberg <alanruttenberg@gmail.com>
- Date: Sat, 28 Jul 2007 00:27:59 -0400
- To: Tim Berners-Lee <timbl@w3.org>
- Cc: Chris Bizer <chris@bizer.de>, "SW-forum Web" <semantic-web@w3.org>, "Linking Open Data" <linking-open-data@simile.mit.edu>, "Jonathan A Rees" <jar@mumble.net>
On Jul 27, 2007, at 2:06 PM, Tim Berners-Lee wrote: > On 2007-07 -27, at 01:18, Alan Ruttenberg wrote: >> While it is true that <http://dbpedia.org/resource/Tim_Berners- >> Lee> is also an identifier for me, there are good social reasons >> why someone might want to use one or the other. I might want to >> introduce myself to someone (or log in at a system) making sure >> that the other party will have access to certain information. I'm not following this example. If you want to introduce yourself in RDF and control what the other party knows, create an information resource under your own domain saying what you want them to know and log in with that. However, I don't see why that information resource would need a different name for you to accomplish what it needs to do. > I might make a link from my data to one or other on the basis that > I am more convinced that one or other will be well-maintained. One or the other what? Data? If they are providing data then that's what they should say their URIs denote. I see no reason to get this mixed up with the business of creating a new proper name for something. > I'm not going to persuade people to use one or the other. I do > link the one I control to the dbpedia one with owl:sameAs. I've noticed. My preference would be to get together with the people that I work with and agree on the name we will use to talk about things we need to talk about. AWWW 2.3.1 pretty much summarize why that's a good idea. > You say then, > >> I guess I am arguing that it is always a bad idea to mint your own >> URI if you believe that some other URI names exactly the thing >> that you are about to name with yours. So if there is a URI that >> you are sure identifies a specific person, then use that instead >> of inventing a new one. On the other hand, if you want to mint a >> URI that is a resource *about* that person, according to you, then >> it's fine to mint one for that - no one else can claim to have >> exactly the same resource about that person. > > I disagree. I think that in general, there should be a small > number of URIs. In general, yes, it is good to use a well- > recognized one. But there are cases when it makes sense to make an > identifier. > > I gave a talk a while ago at crossref.org, which maintains the doi: > set of Digital Object Identifiers for books. ("TIP: You can turn a > DOI string into a URL by appending the DOI string to http:// > dx.doi.org/") They said they had a big problem: their databases > contain information connecting books and authors. They use dois for > the books, but what can they use for people? There is no central > registry for people. They have no right to invent identifiers for > people. They had run it not this problem because they were > thinking centralized, not weblike. They had the model that there > should be on central name for a book (and they should run it). > This breaks because other people have their IS for everything too > -- no one can practically socially be the one central truth, and > that would be a fragile system (socially and technically) if they > were. But the good news if that they still provide a very valuable > function. They provide a source of stable URIs for books (alas no > RDF). It seems to me that there is a whole bunch of ideas mixed together here. I'm not sure I've got it all, but I'll respond to what I can figure out. 1) I don't see how your story relates to your argument. You are arguing with my statement that aliases shouldn't be be created, but your illustration is of a case where there aren't names to alias to in the first place. 2) What does the name have to do with truth? A statement can be true, but a name either identifies something or it doesn't. It seems to me that if you believe that the name identifies the thing enough to say sameAs, then that pretty much settles it. Now in the business I am in this tends not to be the case - it is more often the case that someone thinks they have identified something and named it but when I probe for a simple question: "What is it that you have identified?" they can't give me a straight answer. So in that case I don't use sameAs, since I tend to be pretty clear about what I want to say. 3) "This breaks because other people have their IS for everything too": I don't know what "IS" stands for here. 4) The very valuable function - providing a name that we all use to identify something, and being clear enough when they created those names that many people agree that this is a good name - is exactly what I want to encourage. Later on, after many people use the name, we have the desired robustness. People who depend on using that name will be sure to keep a definition around. People wondering what the name is will be able to look at usage and figure out what it's about. It doesn't really matter whether or not crossref serves any information about the name. In fact if they changed management and started publishing spam and bogus information at those URIs I expect the community would simply ignore them thereafter. I fail to see any robustness in the system that will arise by any other means than this. Moreover I *can* see the damage that will arise if people willy nilly keep creating new URIs to denote the same thing. I thought that's what motivated 2.3.1 4) As far as the URIs for people go, if there were reasonable one already present then crossref would probably have used them. I expect that while a small percentage authors who are SW enthusiasts do have such URIs, the cost of obtaining them was considered prohibitive. In these circumstances we could reasonably encourage crossref to create URIs for people so that they could have good coverage. If they did that job well enough, then I'd encourage other people to use those names, not invent new ones. > The other good news is that crossref CAN make URIs for people. > They can perform the incredibly valuable function of > disambiguating, within that community, the various people with > similar names. They can make RDF IDs for them. If they are very on > the ball, they will even allow author to store another RDF ID, like > they FOAF ID, in the crossref database, just like allowing an > author to link to their own homepage. I think it is appropriate to use sameAs in such situations. And I've already said that I agree with the necessity of minting URIs if there aren't already good ones, or in this case, where the cost/benefit doesn't warrant them collecting the very small and not so easy to find set of already existing URIs. > So, how does this relate to the Science commons? I think the life > sciences folks should not hold their breath until there is a unique > identifier for each protein, an a unique concept for what a > "protein" is exactly. First, they aren't holding their breath(e.g. http://purl.uniprot.org/ uniprot/P42858). Second, the issue isn't what the concept of protein is, but rather good engineering practice in representing the contents of their databases in such a way as maximize their sensibility to a stupid machine. It's not that hard - their curators have done all the heavy lifting to gather statements about the variety of different sets of proteins(splice variants, mutations variants) that each current record talks about. Now all that is needed to encode that information using the best current standard SW language for the job: OWL. > They should serve up the actual records about these things as > documents, with known provenance and features and failings. No one has argued that they shouldn't. What I have argued is that they try to convince people that the have provided an adequate name for a clear delineated thing or set of things in the world. There are people who want to do that - let them name the things, and Uniprot name the records. Also note that Science Commons is trying to support science. I've spent the last bunch of years serving information to scientists who need better that what is currently being provided. Last week I spend a couple of days at workshop where a group of biologists, experimentalists, and informaticians strategized how to organize and use information to help find drugs to treat and cure Huntington's disease. One of the cries: "Stop only recording that you used 'Huntingtin' in your experiment. Tell me whether it was exon 1 of Huntingtin, or what fragment, or if was wild type huntingtin, or which mutated version..." I need to have names for these things in order to support this effort. > So a protein may get Ids in uniprot and in the Gene Ontology, > where the mapping isn't 100% crystal clear. Look, that may be fine for you, and fine in some case, but it isn't fine in many cases. These decisions have consequences. A diamond sellers won't tolerate fuzziness or looseness in the quality definitions of their diamonds if that's what determines the price. It's not OK for them to just advertise "Diamond: good quality". They want to say I have a 3.2 carat pear cut IF clarity H color diamond, and if that is misinterpreted by a SW machine agent to be a VS1 clarity P color diamond in some transaction they would lose money. > And then mapping files can be provided where the mappings exist. > This allows each data source to change if necessary, as new > understandings arise. What will the nature of these mappings be? It isn't a simple matter of saying this URI maps to that. There's a lot of structure - If one database defines 'Huntingtin' protein as any protein that is generated by rna that is transcribed from a genetic locus, then that's one set of things. If another defines it as a set of proteins that has a certain primary structure (i.e. specific sequence of amino acids), then that's another set of things. It another talks about certain things that happen to the "mature" form of some protein (often a fragment or otherwise modified from the original) this is about a different set of things. The sets (classes) relate to each other in that some are subsets of others and some don't share any members. Databases like Uniprot and Entrez gene have enough information in them to define those sets in the first place. That's what I'm encouraging them to do. If they do that well enough I'd recommend that others that need to talk about these sets of things use the names that are minted in that process. That's just a matter of good engineering, the sort that is encouraged in 2.3.1 > The system must not be so rigidly connected that nothing can grow. Where in anything I have said is there any indication that I am recommending something like that? I think that the strategy we're pursuing is causing pretty nice growth - DERI recently put out a map that showed some of the common classes and relations on the semantic web, and there were a surprising number of them that were generated in the process of doing the W3C HCLS demo. And we've just started. > The service of the data should be maintained by the organization > which maintains the data, after an initial period when people > externally show them how it is done. (like biordf and bio2rdf). There are several parts of this. What the names are, how one uses them to get information, and what that information is. I think it is very helpful, at least at the outset, if the name can be used in a simple way to guide one to the information about the name. While I would like the information about the name to be served by an organization that collects and maintains such knowledge, my current thinking is that the names that science uses are important enough that they shouldn't be under the control of any specific provider. I think it ought to be more like collaborative ontology building, and the resultant names be in the control of the community that depends on them. In the past we have suffered too many cases of 404 that we aren't in a position to repair. That's why I (in my role as a technical person at Science Commons) have chosen to use purls for names and means of locating information, and to suggest that the administration of these purls be by a community of users. Should some database of important entities disappear from the web, we at least then have a chance to resurrect a copy and redirect the purls to that copy without incurring the cost of minting a new set of names (well outlined in 2.3.1) YMMV. > Once these identifiers cease to be the one and only central ID, > then they can be minted pragmatically. I don't understand this. Would it be possible to explain what you mean? > I'm not going to jump up and down about the # vs / but I think > when the data is presented as a set of records which I seem to > remember it is, the # approach might seem less weird to you. As you know, I am jumping up and down a bit about hashes being a bad idea. I won't repeat myself here - I think I've laid out why in previous email, and we've discussed this in person recently. I'm not understanding what you are getting at about the records and connection to hash versus slash though. Your thoughts on my response would, of course, be very much appreciated - thanks for your comments! Best, Alan
Received on Saturday, 28 July 2007 04:28:22 UTC