Re: Terminology Question concerning Web Architecture and Linked Data from Alan Ruttenberg on 2007-07-26 (semantic-web@w3.org from July 2007)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Thu, 26 Jul 2007 01:00:46 -0400
To: Chris Bizer <chris@bizer.de>
Cc: "SW-forum Web" <semantic-web@w3.org>, "Linking Open Data" <linking-open-data@simile.mit.edu>, "Jonathan A Rees" <jar@mumble.net>, <www-tag@w3.org>
Message-Id: <26D5DD9F-D68F-45FA-9588-C3016B6A2A26@gmail.com>
On Jul 23, 2007, at 3:23 AM, Chris Bizer wrote:

> Hi Alan,
>
> very fruitful discussion. Thanks for challenging me on this point :-)
>
>> So you have two novel claims:
>>
>> 1) It is better to mint your own URI than to use one that you know  
>> to identify the same resource.
>> 2) It is better to attach "different views and opinions" about a   
>> known resource to a newly minted URI that you state is owl:sameAs   
>> some other rather than using an alternative mechanism for doing  
>> so,  one of which might be the one I suggested.
>
> I basically see four arguments in favour of my point:
>
> 1. Practicability: There is no commonly accepted infrastructure in  
> place that allows applications to find out the single URI that  
> should be used by everybody to identify a resource. There are lots  
> of real-world object and abstract concepts that do not have URIs  
> yet, so you have to mint URIs for them yourself anyway. Also as  
> Christopher Brewster pointed out yesterday, all approaches that  
> assumed using single identifiers have failed throughout history so  
> far.

This is the necessary evil argument. I accept that this practice is  
necessary in cases. It was a smart architectural decisinnot to have  
the unique name assumption for the Semantic Web, and for OWL's to  
provide the ability to take advantage of that state of affairs. But  
that doesn't mean that one shouldn't put some serious thought about  
the practical difficulties of dealing with a world where there are  
multiple names for things, and therefore use the capability sparingly.

> 2. Provenance Tracking: If you mint your own URIs you can back them  
> up with RDF descriptions, which makes it easy to track who said  
> what on the Semantic Web, as there is only one authoritative  
> information provider for each URI.

I would prefer an explicit mechanism for tracking provenance, such as  
a vocabulary and protocol for doing so, rather than one which  
conflicts with another element of your usage. owl:sameAs means  
indistinguishable - the exact same thing. Better to have objects  
which are explicitly "commentaries about the thing" (rather than  
aliases of the thing), stated by someone at some time by some  
authority etc, and than overload the naming mechanism, IMO.

> 3. Discovery: When you know that two URIs refer to the same non- 
> information resource, it is extremely easy and does not require any  
> new technical infrastructure to retrieve information about this  
> resource from the Web: Just dereference both URIs.

When you know. When you don't, you miss an opportunity to "join",  
i.e. gather all the information about an entity you might be  
interested in.

> 4. Information Quality: Information providers will not set  
> owl:sameAs links to minor quality information provided by somebody  
> else about the same non-information resource. Therefore setting a  
> owl:sameAs link implies a quality judgement and a client can use  
> these judgements to assess information quality using an algorithm  
> like PageRank.

There is no basis for this assertion. sameAs isn't a statement about  
information quality. It is a statement about identity. You could  
argue, within a community, that it serve this purpose, and then, with  
adequate advertising agree among you to do things this way, but I  
think that asserting that this is the case in a tutorial for naive  
users is somewhat misleading. Certainly this will not be understood  
by the SW community at large.

> I also do not say that you should always mint your own URIs. Note  
> that we also have an example where somebody reuses an existing URI  
> and provides non-authoritative information about a resource within  
> our Linked Data tutorial (http://sites.wiwiss.fu-berlin.de/suhl/ 
> bizer/pub/LinkedDataTutorial/#deref).

Just so you know where I stand, I find this whole "authority"  
business confusing and ripe for misunderstanding, AWWW or not. It is  
clear to me that authority doesn't define meaning or correctness.  
Absent that, we need to be careful about saying exactly what this  
authority confers. In your tutorial section 5 I honestly don't  
understand what differs about the status of the statements in the  
authoritative versus non-authoritative versions.

> I'm not also completely clear about which approach is better in  
> which situations. This would be something very interesting to  
> discuss here on the list.
>
> I just say that there situations where minting your own URIs and  
> interlinking them later with automated algorithms is more  
> practical. At least it is more practical in the situation we are  
> facing in the Linking Open Data project.

Again, this is the necessary evil stance, which I think is warranted,  
as you do.

> As we should aim at deploying the Semantic Web/Web of Data now, I  
> also think that we should not wait for future name discovery  
> infrastructures, community agreement about naming schemata or the  
> like, but use an approach that works now.

Yes, but also one that works in the future. Having hidden (as in not  
rdf) information about nonstandard interpretations of vocabulary  
isn't a robust strategy.

>> rather than using an alternative mechanism for doing so,  one of  
>> which might be the one I suggested.
>
> Alan sorry, which mechanism did you suggest?

I suggested that, for instance, why not have e.g. dbpedia only have  
statements associated with *information resources* which are  
understood as "community statements about" some subject. It is like  
your non-authoritative description, augmenting the metadata with a  
more explicit description of the intention, provenance, quality,  
mechanism by which the statements were gathered and other information  
that an agent could use to decide how to interpret the statements  
therein. If you were being extra careful you would have another  
resource for the data, and point to it so that one could clearly  
choose, based on the metadata, whether or not to dereference or use  
the "description".

In your example 1 I wonder whether the author knew of http:// 
zitgist.com/music/artist/d71ba53b-23b0-4870-a429-cce6f345763b before  
minting http://dbpedia.org/resource/Alec_Empire. If that were the  
case,  I would say it would have been better to have not minted  
http://dbpedia.org/resource/Alec_Empire, instead reusing the  
identifier *the author itself* believed to be the proper name of the  
resource (that's what you are asserting with the owl:sameAs). To to  
put it more harshly, doing otherwise would be an unnecessary evil.

--

I'll note another issue: The Creative Commons licenses are about  
copyright, which protects expression, not facts. To the extent that  
the rdf is data/statement of fact, it is not clear that these  
licenses are relevant, and might even be harmful. I suggest you  
consult a lawyer about this usage - or perhaps discuss it with the  
Science Commons counsel - Thinh Nguyen (thinh@creativecommons.org)

I'm hoping to have some time to take a closer look at the whole  
tutorial and get back to you with some more comments. (and don't get  
me wrong: I'm a big fan of doing this project  - but I want the  
effort to be the start of a truly scalable enterprise)

-Alan


>
> Cheers
>
> Chris
>
>
> --
> Chris Bizer
> Freie Universität Berlin
> +49 30 838 54057
> chris@bizer.de
> www.bizer.de
> ----- Original Message ----- From: "Alan Ruttenberg"  
> <alanruttenberg@gmail.com>
> To: "Chris Bizer" <chris@bizer.de>
> Cc: "SW-forum Web" <semantic-web@w3.org>; "Linking Open Data"  
> <linking-open-data@simile.mit.edu>; "Jonathan A Rees"  
> <jar@mumble.net>; <www-tag@w3.org>
> Sent: Monday, July 23, 2007 3:16 AM
> Subject: Re: Terminology Question concerning Web Architecture and  
> Linked Data
>
>
> Hi Chris,
>
> While you outline an interesting problem, it doesn't address the
> question I asked. Specifically, you said:
>
> On Jul 20, 2007, at 9:02 AM, Chris Bizer wrote:
>> we argue in section 1.1 of our Linked Data tutorial (http://  
>> sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/  
>> #aliases) that URI aliases provide an important social function  
>> to  the Web as they are dereferenced to different descriptions of  
>> the  same non-information resource and thus allow different views  
>> and  opinions to be expressed.
>>
>> Which is an interesting conclusion as it conflicts with the AWWW   
>> view that URI aliases are harmful.
>> See http://www.w3.org/TR/webarch/#uri-aliases
>
> In other words, these aliases are not simply a necessary evil, but a
> positive good. This was the claim I was (and am still) testing.
>
> In particular, your advise is that when providers know of the
> existence of alternate URIs, they note this with owl:sameAs,
> implicitly recommending this mechanism rather than the alternative of
> simply using an already minted URI that denotes the same thing.
>
> To my mind it might make more sense to do the latter, and it is to
> this that the webarch reference you note speaks to.
>
> So you have two novel claims:
>
> 1) It is better to mint your own URI than to use one that you know to
> identify the same resource.
> 2) It is better to attach "different views and opinions" about a
> known resource to a newly minted URI that you state is owl:sameAs
> some other rather than using an alternative mechanism for doing so,
> one of which might be the one I suggested.
>
> Do I read you wrong?
>
> -Alan
>
>
> On Jul 22, 2007, at 4:29 PM, Chris Bizer wrote:
>
>> Hi Alan,
>>
>>> Thanks for the more detailed information. While I agree with the   
>>> need to be able to have a mechanism for making statements about   
>>> URIs that one doesn't mint, such as http://www.w3.org/People/   
>>> Berners-Lee/ card#i, what I don't follow in your discussion is   
>>> why such additional statements need to be attached to an alias   
>>> (in the sameAs sense) of  the original URI. It would seem worth   
>>> justifying this in the light of  the associated costs of such  
>>> aliases
>>>
>>> - The lower likelihood of successful "joins" in queries if a)  
>>> Not  all "sameAs"s are available to an agent or b) The agent's  
>>> reasoner  isn't capable of correctly handling sameAs
>>> - The uncertain semantics of sameAs when taken out of the  
>>> context  of the OWL specification.
>>>
>>> For instance, why not have e.g. dbpedia only name *resources*   
>>> which  are understood as "community statements about" some   
>>> subject, in which statements about tbl would use his designated   
>>> name for himself?
>>>
>>
>> Yes, in a perfect world you are right, but unfortunately, we are   
>> not living in a perfect world.
>>
>> DBpedia is a good example for this. We are assigning URIs to   
>> 1,600,000 resources and we don't have a clue which URIs we assign   
>> to some town, molecules, flowers or planets. We even don't know  
>> if  we assign URIs to flowers at all, before we search within our   
>> dataset for flowers.
>>
>> We do this because we want to create a useful open dataset in the   
>> short term. If we would wait until there is community agreement  
>> in  each domain that DBpedia covers about a naming schema or wait  
>> until  each of the described resources has assigned a URI to  
>> itself, we  won't get anywhere. If there would be community  
>> agreement about  naming schemata (which there is not and I also do  
>> not expect such  agreement to evolve in the mid-future), the next  
>> problem would be  to bring some complicated infrastructure into  
>> place that allows  applications like DBpedia to find out that  
>> http://www.w3.org/People/ Berners-Lee/card#i is only URI that  
>> should be used to refer to Tim  (think about stuff like URI SPAM  
>> and all the trust mechanism such  an infrastructure would need).
>>
>> So, I think that the approach of assuming that single URIs for  
>> identifying real-world resources will evolve does not scale for   
>> practical reasons.
>>
>> Evidence for this opinion can be found in the Linking Open Data   
>> project (http://esw.w3.org/topic/SweoIG/TaskForces/  
>> CommunityProjects/LinkingOpenData) where most datasources are   
>> backed by large legacy databases and it is unrealistic to require   
>> publishers to find out the only acceptable URI for each of their   
>> 100 000 data items.
>>
>> The project is aiming at having hundreds of billions RDF triples   
>> online in the mid-term. Think of data souces like Freebase  
>> (http:// www.freebase.com/), the Open Library (http:// 
>> demo.openlibrary.org/)  or all public US government data (http:// 
>> www.cs.umd.edu/class/ spring2006/cmsc838s/data_repositories/ 
>> repository_us.html).
>>
>> In such situations, I think it is more realistic from the  
>> practical  point of view to use a two step process:
>>
>> 1. Allow each data provider to assign his own URIs to resources   
>> (not much effort for him, just dump his database as Linked Data).
>> 2. Use some equivalence mining algorithms afterwards to find out   
>> which URIs talk about the same things.
>>
>> We do a lot of such equivalence mining in the Linking Open Data   
>> project and it works fine (good enough).
>> See: http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/  
>> LinkedDataTutorial/#autogenerateLinks and
>> http://esw.w3.org/topic/TaskForces/CommunityProjects/  
>> LinkingOpenData/EquivalenceMining
>>
>> I agree with you that this approach has a "lower likelihood of   
>> successful "joins"", but I rather prefer to data mine useful   
>> information out of a pile of junk than to wait until there is   
>> community agreement about ontologies and naming schemata.
>>
>> Note, that this approach is also taken by Google Base and these   
>> guys are already rather successful with it.
>>
>> I'm also not too concerned about the "agents reasoner not being   
>> capable of correctly handling sameAs". I expect that agents and   
>> search engine will implement reasoners for specific sets of   
>> predicates (and owl:sameAs is very likely to be in this set). I'm   
>> sceptical about general RDF-S/OWL reasoners, because it will take  
>> a  while until they are capable to handle hundreds of billions of   
>> triples and this is the amount of data that we need to in order  
>> to  relevant in the light of Web 2.0.
>>
>> Cheers
>>
>> Chris
>>
>>
>>> -Alan
>>>
>>
>> On Jul 20, 2007, at 9:02 AM, Chris Bizer wrote:
>>
>>> Hi Alan,
>>>
>>>> However, I am curious to know what you were asking, so if you   
>>>> do,  I will be appreciative.
>>>
>>> My question was aiming more into the direction of how AWWW and  
>>> OWL terminology plays together.
>>>
>>> owl:sameAs if defined as "The built-in OWL property owl:sameAs    
>>> links an individual to an individual. Such an owl:sameAs   
>>> statement  indicates that two URI references actually refer to  
>>> the  same thing:  the individuals have the same  
>>> "identity" (http:// www.w3.org/TR/owl- ref/#sameAs-def)
>>>
>>> There was a long discussion and a lot of confusion on the  
>>> SemWeb   list about two weeks ago whether owl:sameAs is the  
>>> right  predicate  that should be used to indicate that two URIs  
>>> refer to  the same  "thing". With "thing" being a OWL term that  
>>> does not  exist in AWWW  terminology.
>>>
>>> So, if the anwer to my first question would have been that the  
>>> different URIs for Tim refer to different resources, there  
>>> would   have been a problem with "refering to the same thing".  
>>> But as  Dan's  answer to my first question indicated that the  
>>> different  URIs refer  to the same non-information resource,  
>>> meaning that  they are URI  aliases, there is no problem and I  
>>> see the issue as  being closed.
>>>
>>> Building on this, we argue in section 1.1 of our Linked Data    
>>> tutorial (http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/  
>>> LinkedDataTutorial/#aliases) that URI aliases provide an   
>>> important social function to the Web as they are dereferenced to   
>>> different descriptions of the same non-information resource and   
>>> thus allow different views and opinions to be expressed.
>>>
>>> Which is an interesting conclusion as it conflicts with the  
>>> AWWW   view that URI aliases are harmful.
>>> See http://www.w3.org/TR/webarch/#uri-aliases
>>>
>>> Cheers
>>>
>>> Chris
>>>
>>>
>>> --
>>> Chris Bizer
>>> Freie Universität Berlin
>>> Phone: +49 30 838 54057
>>> Mail: chris@bizer.de
>>> Web: www.bizer.de
>>>
>>> ----- Original Message ----- From: "Alan Ruttenberg"  
>>> <alanruttenberg@gmail.com>
>>> To: "Chris Bizer" <chris@bizer.de>
>>> Cc: "Dan Connolly" <connolly@w3.org>; <www-tag@w3.org>; "SW- 
>>> forum   Web" <semantic-web@w3.org>; "Linking Open Data" <linking- 
>>> open- data@simile.mit.edu>; "Jonathan A Rees" <jar@mumble.net>
>>> Sent: Friday, July 20, 2007 2:28 PM
>>> Subject: Re: Terminology Question concerning Web Architecture and  
>>> Linked Data
>>>
>>>
>>> Hi Chris,
>>>
>>> Your assessment is perfectly reasonable. I was thrown off by the
>>> question you initially asked:
>>>
>>>> Question 3: Depending on the answer to question 1, is it  
>>>> correct   to use owl:sameAs [6] to state that http://www.w3.org/ 
>>>> People/ Berners-Lee/ card#i and http://dbpedia.org/resource/ 
>>>> Tim_Berners-  Lee refer to the same thing as it is done in Tim's  
>>>> profile.
>>>
>>> Given that you didn't intend the sense of "correct" that I thought
>>> (recall that I was guessing, from context, which sense of correct  
>>> you
>>> meant in your question), which sense of "correct" did you mean?  
>>> Or to
>>> phrase it another way, if one were to answer the question "no", what
>>> sort of evidence would you accept to support that answer.
>>>
>>> This isn't a matter of philosophy, it's a matter of communication. I
>>> really don't know what you are asking. Another way to accomplish the
>>> communication would be to rephrase the question without using the
>>> word "correct".
>>>
>>> I don't mean to suggest you are obligated to clarify this for me.
>>> However, I am curious to know what you were asking, so if you do, I
>>> will be appreciative.
>>>
>>> -Alan
>>>
>>>
>>>
>>> On Jul 20, 2007, at 3:55 AM, Chris Bizer wrote:
>>>
>>>> Hi Alan,
>>>>
>>>> I'm not a philosopher, but I have the feeling that the concept  
>>>> "correct" in a sence of matching reality does not really apply   
>>>> to  the Semantic Web setting.
>>>>
>>>> We are talking about machines that are supposed to process  
>>>> data   from different sources. There is no such thing as  
>>>> "reality" for  a  machine. For the machine there is only data!  
>>>> (or knowledge if  you  prefer this term)
>>>>
>>>> Therefore the question for the machine is: Should it trust a    
>>>> specific piece of data or not? Or more precisely how can it   
>>>> assess  the quality of the data to a point where it matches the   
>>>> quality  requirements of the user (human).
>>>>
>>>> There are lots of different heuristics that a machine can apply   
>>>> to assess information quality, including content-based,  
>>>> context-  based, rating-based heuristics.
>>>>
>>>> For more details than you ever wanted to hear, please refer to   
>>>> my  PhD thesis titeld "Quality-driven Information Filtering in   
>>>> the  Context of Web-based Information System" http://  
>>>> sites.wiwiss.fu- berlin.de/suhl/bizer/pub/DisertationChrisBizer.pdf
>>>>
>>>> Cheers
>>>>
>>>> Chris
>>>>
>>>> --
>>>> Chris Bizer
>>>> Freie Universität Berlin
>>>> +49 30 838 54057
>>>> chris@bizer.de
>>>> www.bizer.de
>>>> ----- Original Message ----- From: "Alan Ruttenberg"  
>>>> <alanruttenberg@gmail.com>
>>>> To: "Dan Connolly" <connolly@w3.org>
>>>> Cc: "Chris Bizer" <chris@bizer.de>; <www-tag@w3.org>; "SW- 
>>>> forum   Web" <semantic-web@w3.org>; "Linking Open Data" <linking- 
>>>> open- data@simile.mit.edu>; "Jonathan A Rees" <jar@mumble.net>
>>>> Sent: Friday, July 20, 2007 4:52 AM
>>>> Subject: Re: Terminology Question concerning Web Architecture   
>>>> and Linked Data
>>>>
>>>>
>>>>> On Jul 10, 2007, at 1:08 PM, Dan Connolly wrote:
>>>>>> On Sat, 2007-07-07 at 14:43 +0200, Chris Bizer wrote:
>>>>>>
>>>>>>> Question 3: Depending on the answer to question 1, is it     
>>>>>>> correct to use
>>>>>>> owl:sameAs [6] to state that http://www.w3.org/People/  
>>>>>>> Berners-  Lee/ card#i and
>>>>>>> http://dbpedia.org/resource/Tim_Berners-Lee refer to the  
>>>>>>> same   thing as it is
>>>>>>> done in Tim's profile.
>>>>>>
>>>>>> Yes...
>>>>>>
>>>>>> That's sort of a circular question. It's correct because  
>>>>>> Tim    says it's correct, and he owns that name.
>>>>>
>>>>> That's not the usual sense of "correct". In this context, I    
>>>>> believe that the wordnet sense of "correct" that is intended is
>>>>> "free from error; especially conforming to fact or truth"
>>>>>
>>>>> Or Wikipedia: "In everyday use, the correctness of a statement   
>>>>> is determined by whether or not it matches reality. People can   
>>>>> think  a statement is correct and be wrong."
>>>>>
>>>>> If I had a profile that said, in effect, that I was president   
>>>>> of  the United States, then that would be incorrect regardless   
>>>>> of  whether I owned the name (I am taking the "owned name"  
>>>>> that  you  are referring to to be http://www.w3.org/People/ 
>>>>> Berners- Lee/  card#i since that's the only name in the  
>>>>> vicinity that Tim  could  correctly claim to be owned by him).
>>>>>
>>>>> If I'm using the wrong sense of "correct", perhaps you could    
>>>>> provide me a definition of "correct" by which I could   
>>>>> understand  your claim.
>>>>>
>>>>> Regards,
>>>>>
>>>>> -Alan
>>>>
>>>
>>
>>
>
Received on Thursday, 26 July 2007 05:00:43 UTC