Re: URI aliases and owl:sameAs was: Terminology Question concerning Web Architecture and Linked Data from Alan Ruttenberg on 2007-07-27 (semantic-web@w3.org from July 2007)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Fri, 27 Jul 2007 01:18:41 -0400
To: Chris Bizer <chris@bizer.de>
Cc: "SW-forum Web" <semantic-web@w3.org>, "Linking Open Data" <linking-open-data@simile.mit.edu>, "Jonathan A Rees" <jar@mumble.net>
Message-Id: <E453D7D9-E2D3-428A-90D3-E7723B19DCA7@gmail.com>
Conversation continued. Edited out some stuff [...] to prevent  
blowup, but I don't know how to do this well enough, so this is a  
long message with new comments interspersed in the conversation.

On Jul 26, 2007, at 6:46 AM, Chris Bizer wrote:
>> On Jul 23, 2007, at 3:23 AM, Chris Bizer wrote:
>>> Some time Alan Ruttenberg wrote:
>>>> So you have two novel claims:
>>>> 1) It is better to mint your own URI than to use one that you  
>>>> know to identify the same resource.
>>>> 2) It is better to attach "different views and opinions" about a  
>>>> known resource to a newly minted URI that you state is  
>>>> owl:sameAs some other rather than using an alternative mechanism  
>>>> for doing so,  one of which might be the one I suggested.
>>>
>>> I basically see four arguments in favour of my point:
>>> 1. Practicability: There is no commonly accepted infrastructure  
>>> in place that allows applications to find out the single URI that  
>>> should be used by everybody to identify a resource. [...]
>>
>> This is the necessary evil argument. ... But that doesn't mean  
>> that one shouldn't put some serious thought about the practical  
>> difficulties of dealing with a world where there are multiple  
>> names for things, and therefore use the capability sparingly.
>
> Agreed. I think our different view might result from the different  
> domains we are working in.
> You are working on biology and life science and as I understood  
> there are some "kind of" commonly accepted naming schemata for this  
> domain. I say "kind of" because I was the "URL +1, LSID -1" thread  
> with 148 replies on your mailing list :-)

heh. They're a passionate crowd, not unlike you LODites :)

> As I'm working on Dbpedia and the LOD project and we are touching  
> lots of domains without accepted naming schemata there.

That's not too different. But we're working hard in the bio world to  
come to some sort of agreement about how to name things we all want  
to talk about. I'm not so sure that's not a good idea for you folks  
to consider as well. Wikipedia is a big and potentially pretty good  
source of names to use and reuse, for starters.

> Also note our argument in the tutorial, that if you have a commonly  
> accepted schema like ISBN or ISIN numbers for your domain, put this  
> numbers into your URIs to ease automated link generation, but still  
> use your own URIs so that people can dereference them to YOUR  
> description of the book or stock. (see http://sites.wiwiss.fu- 
> berlin.de/suhl/bizer/pub/LinkedDataTutorial/#autogenerateLinks)

Right. I say truth in advertising. If your URI represents a  
"description of" then say that that's what the URI denotes, rather  
than saying it denotes the thing itself.

> It would be interesting to discuss in which cases this is a good  
> practice and in which you should use URIs that have been minted by  
> somebody else.

I guess I am arguing that it is always a bad idea to mint your own  
URI if you believe that some other URI names exactly the thing that  
you are about to name with yours. So if there is a URI that you are  
sure identifies a specific person, then use that instead of inventing  
a new one. On the other hand, if you want to mint a URI that is a  
resource *about* that person, according to you, then it's fine to  
mint one for that - no one else can claim to have exactly the same  
resource about that person.

But the name for a description of a thing is not the name of the  
thing. This:

http://www.amazon.com/How-Cook-Everything-Mark-Bittman/dp/076456756X/ 
ref=pd_bbs_sr_1/103-5470818-1655801? 
ie=UTF8&s=books&qid=1185507929&sr=8-1

Is the "name" of a resource about a (set of) books about cooking

This: urn:076456756X is the "name" of a set of books about cooking

This: http://reliant.teknowledge.com/DAML/SUMO.owl#Cooking (at least  
intends) to name a set of processes in which cooking happens.

Do we agree? My complaint is that it is not a good idea to either a)  
make another name for http://reliant.teknowledge.com/DAML/ 
SUMO.owl#Cooking or b) confuse things by wanting to say that a  
resource about cooking is that same as cooking.

>>> 2. Provenance Tracking: If you mint your own URIs you can back  
>>> them up with RDF descriptions, which makes it easy to track who  
>>> said what on the Semantic Web, as there is only one authoritative  
>>> information provider for each URI.
>>
>> I would prefer an explicit mechanism for tracking provenance, such  
>> as  a vocabulary and protocol for doing so,
>
> Jeremy Carol, Pat Hayes, Patrick Stickler and I developed such a  
> vocabulary a while ago.
> See "Semantic Web Publishing Vocabulary" in the Named Graphs paper  
> [...]
> Requiring everybody to use such a vocabulary would be nice, but  
> does not scale to the complete Web in my opinion.

First question: Why? Second question: Why require everyone do  
something? How about just the LODites? One step at a time.

>> rather than one which  conflicts with another element of your usage.
>
> Where do you see conflicts?

Punning the use of a URI to name a thing, and also to name the  
opinion of an information provider about the thing.

> I think this approach works nicely together with using the Named  
> Graphs data model on the client to represent information that has  
> been retrieved from the Web by dereferencing URIs. See "Semanitc  
> Web Client Library" and the DISCO browser for applications that do  
> exactly this.

Sorry my stack has overflowed - I can't follow how to connect all the  
pieces to understand how one thing relates to the other. Let's talk  
about this when we meet in person perhaps. Or maybe lay out the  
connection in small steps for me if you are in the mood (but in a new  
thread :)

>>> 3. Discovery: When you know that two URIs refer to the same non-  
>>> information resource, it is extremely easy and does not require  
>>> any new technical infrastructure to retrieve information about  
>>> this resource from the Web: Just dereference both URIs.
>>
>> When you know.
>
> Sure, when you know. Thats why we recommend in our tutorial to  
> properly interlink your data with other data.
> If enough people do this, then you know.
> [....]

We are going in circles. You don't have to figure it out after the  
fact if you use the same name in the first place. I'm simply arguing  
for not suggesting otherwise. If you can't, you can't. If you can,  
don't  teach people that it is a reasonable choice not to.

>>> 4. Information Quality: Information providers will not set  
>>> owl:sameAs links to minor quality information provided by  
>>> somebody else about the same non-information resource. Therefore  
>>> setting a owl:sameAs link implies a quality judgement and a  
>>> client can use these judgements to assess information quality  
>>> using an algorithm like PageRank.
>>
>> There is no basis for this assertion. sameAs isn't a statement  
>> about information quality. It is a statement about identity.
>
> No, an owl:sameAs statement connection information from two  
> different data sources, for instance
> http://dbpedia.org/resource/Harry_Potter_and_the_Half-Blood_Prince  
> owl:sameAs
> http://www4.wiwiss.fu-berlin.de/bookmashup/books/0747581088, is an  
> external RDF link (in the same way that all other statements that  
> connect different data sources are RDF links).

Then there is no reason to use owl:sameAs. As any rdf link will do  
for this purpose, you might as well use one you popularize, provide a  
definition that works, and then use that. Call is lod:sameAs, say  
exactly what is intended by its meaning and then use it in that way.

> As information providers usually do not set RDF links to  
> information that sucks, setting an RDF Links implies some quality  
> judgement.

You obviously haven't been looking at the same kind of data I've been  
looking at for the last 10 years. Information providers routinely  
provide sucky information. It lands up sucking for a bunch of reasons
- integration of multiple sources based on faulty assumptions which  
removing provenance tracking information,
- misunderstanding of the domain
- programming errors in systems which have no sanity checks or means  
of validation
- buggy format converters (like excel)
- spelling and other silly errors.
....

Even people that considerably care about their work routinely  
accumulate errors. Have a look at the errors uncovered with some  
pretty simply OWLing we did here: http://bio.freelogy.org/wiki/ 
Debugging_the_bug#OWL_files.2C_patches.2C_and_matches

> This is exactly the same as with hypertext links on the classic  
> document web. You can use these links as quality indicator. This is  
> how Google made its first billion.

Yes, but I wouldn't want my machine to make any important decisions  
based on a google search. Perhaps here is where are approach differs.  
I'm not interested in browsing the semantic web. I want my machine to  
do work using the semantic web. Browsing for humans can be improved,  
but I don't think all the effort of the SW is worth the improvement  
we will get. The improvement will be in enabling our machines to do  
new things because we've structured the information it in a way that  
we can write reliable programs using it.

>> You could  argue, within a community, that it serve this purpose,  
>> and then, with  adequate advertising agree among you to do things  
>> this way, but I  think that asserting that this is the case in a  
>> tutorial for naive  users is somewhat misleading. Certainly this  
>> will not be understood  by the SW community at large.
>
> I don't think so, the Semantic Web communy knows a lot about WEB  
> technology and they will understand.
> One just have to look at examples from the classic document Web:  
> Google and alike.

I think we were talking past each other on this. You seem to be  
talking about links in general, of which owl:sameAs is just some  
link. I'm talking about owl:sameAs as very specific type of link with  
an established meaning. I assumed that you added, to that specific  
meaning, something about how you should infer something about the  
quality of some information.

>>> I also do not say that you should always mint your own URIs. Note  
>>> that we also have an example where somebody reuses an existing  
>>> URI and provides non-authoritative information about a resource  
>>> within our Linked Data tutorial (http://sites.wiwiss.fu-berlin.de/ 
>>> suhl/ bizer/pub/LinkedDataTutorial/#deref).
>>
>> Just so you know where I stand, I find this whole "authority"  
>> business confusing and ripe for misunderstanding, AWWW or not. It  
>> is clear to me that authority doesn't define meaning or correctness.
>
> The Web is by definition a distributed open system and that why you  
> will not be able to avoid the "whole authority business" ;-)

Let's not talk in generalities. I meant specifically that the  
accounting of authority according to the AWWW doesn't make sense to  
me - that the properties ascribed to the authorities don't seem like  
things that the authorities have the authority or the means to enforce.

>> Absent that, we need to be careful about saying exactly what this  
>> authority confers. In your tutorial section 5 I honestly don't  
>> understand what differs about the status of the statements in the  
>> authoritative versus non-authoritative versions.
>
> Just translate authority with URI owner. In AWWW, the URI owner has  
> a special position as he is the only entity that can officially  
> define what a URI refers to. Therefore, clients can treat  
> information from an URI authority as more trustworthy if they like.

They can do anything that they like. But I looked at the sort of  
statements that were in your example and they both seemed to carry  
exactly the same weight to me. The appeal to authority in the one  
example had exactly no effect on how much I trusted the data, how  
much I judged it to be authoritative. I claim it will not and should  
not make a difference for anyone else.

Trustworthiness can't be asserted, it must be earned.

> The interesting thing about Web architecture is that they can do  
> this, but that the can also do other things for instance, prefer  
> information from a third in their view highly thrustwothy  
> information provider over the information from the URI authority.  
> This point is often forgotten in our discussions, but a fact of life.

Ah, here we are in more agreement. My claim is that the magnitude of  
this effect is so great as to make the trustworthiness delta  
conferred by the AWWW claim of authority be lost in the noise.

>>> As we should aim at deploying the Semantic Web/Web of Data now, I  
>>> also think that we should not wait for future name discovery  
>>> infrastructures, community agreement about naming schemata or the  
>>> like, but use an approach that works now.
>>
>> Yes, but also one that works in the future. Having hidden (as in  
>> not rdf) information about nonstandard interpretations of  
>> vocabulary isn't a robust strategy.
>
> I don't see any nonstandard interpretation of any piece of  
> vocabulary here.

My comment was a reflection of me thinking that you thought there was  
something special about the owl:sameAs link as opposed to some other  
link. See above.

> Maybe you can argue that the TAG work on this is in draft status  
> right now, but I'm pretty sure that it will turn into official  
> findings in the near future without major modifications. These guys  
> have been thinking about this for years now and the dimensions  
> information resource vs. non-information resouce and generic  
> resource vs specific resource are their solution. Which I don't  
> want to question because I don't want to repeat all these discussions.

The web is young. Some problems will take a long time to figure out.  
I wouldn't be too content with the simple fact of something being a  
TAG finding unless it happened that you also deeply believed it.  
There are a variety of them that I don't believe, which is why I  
continue thinking about the issue.

[...]

>> In your example 1 I wonder whether the author knew of http://  
>> zitgist.com/music/artist/d71ba53b-23b0-4870-a429-cce6f345763b  
>> before minting http://dbpedia.org/resource/Alec_Empire. If that  
>> were the case,  I would say it would have been better to have not  
>> minted http://dbpedia.org/resource/Alec_Empire, instead reusing  
>> the identifier *the author itself* believed to be the proper name  
>> of the resource
>
> I think Alec Empire does not know that the Semantic Web exists and  
> therefore also did not mint a URI to himself.
>
> This is kind of a problem in our context.

Wrong author (he's a musician, anyways :) Let me try again to say  
what the situation is that I think is the unnecessary evil.

Sandra initiates a project to write some RDF about Alec Empire.
As part of her research, she notes that http:// zitgist.com/music/ 
artist/d71ba53b-23b0-4870-a429-cce6f345763b is a proper URI to  
identify that person.
She creates another name to identity Alec Empire, http://dbpedia.org/ 
resource/Alec_Empire
She starts making statements about Alex Empire, using http:// 
dbpedia.org/resource/Alec_Empire as the subject.
One of those statements is
http://dbpedia.org/resource/Alec_Empire owl:sameAs http://  
zitgist.com/music/artist/d71ba53b-23b0-4870-a429-cce6f345763b

What was Sandra's mistake? She had a perfectly good identifier for  
Alec Empire but she created an alias anyways.

Author = Sandra.

>> I'll note another issue: The Creative Commons licenses are about  
>> copyright, which protects expression, not facts. To the extent  
>> that the rdf is data/statement of fact, it is not clear that these  
>> licenses are relevant, and might even be harmful.
>
> Sorry, RDF triples on the Web are clearly not facts but claims by  
> their publisher in the same way that any other Web content also has  
> to be concidered as a claim.
>
> If this would not be the case, all trust problems would be solved  
> and we could delete the trust layer from the Semantic Web  
> architecture stack.

I suppose this interpretation is my fault in that I used the term  
fact. Let me try again. There aren't a whole lot of ways of saying  
(in RDF)  that the element Gold has a molecular weight of 196.96655  
(approximately) or that the computer I am writing this on was has the  
serial number XB0021VJHM1.  It is a separate issue whether or not  
these statements should be believed. This is different than  
recounting the story, in english, of how I came to earn the money  
that let me buy this computer, which can be told in many ways. What I  
meant was that copyright protects a specific way of telling that story 
(the "expression"). It doesn't protect the information that I bought  
the computer or that the serial number is XB0021VJHM1 (at least in  
the U.S.)

Note however, that IANAL.

>>  I suggest you  consult a lawyer about this usage - or perhaps  
>> discuss it with the  Science Commons counsel - Thinh Nguyen  
>> (thinh@creativecommons.org)
>
> This could be interesting. I did not follow the discussions whether  
> data should be considered an expression and is therefore protected  
> by copyright. But I heard with one ear that there was a ruling in  
> the EC that data is considered an expression in most cases.

Yes, you are right about this, which complicates things. I didn't  
realize that when I wrote my note. That's not the case in the U.S. My  
understanding is that the CC licenses for Europe are being adjusted  
to account for that (they don't currently, if I understand correctly).

> I thought you guys at Creative Commons would take care of following  
> these discussions and provide the community with the right licenses  
> so that we don't have to consult an lawyer for a lot of money.

I *did* include the name of our counsel :-)

> Isn't it by any chance in the scope of the Science Commons project  
> to define the license terms that we would need for our Linking Open  
> Data project?

Actually, I agree with you and have been lobbying for exactly that.  
All I can offer at the moment is http://sciencecommons.org/resources/ 
faq/databases/  I believe that there will be progress on that in the  
future but can't promise anything. I just wanted to alert you to the  
fact that there might be an issue.

As usual, thanks for the interesting conversation!

- Alan
Received on Friday, 27 July 2007 05:18:34 UTC