Re: Terminology Question concerning Web Architecture and Linked Data from Chris Bizer on 2007-07-22 (semantic-web@w3.org from July 2007)

From: Chris Bizer <chris@bizer.de>
Date: Sun, 22 Jul 2007 22:29:28 +0200
To: "Alan Ruttenberg" <alanruttenberg@gmail.com>
Cc: "SW-forum Web" <semantic-web@w3.org>, "Linking Open Data" <linking-open-data@simile.mit.edu>, "Jonathan A Rees" <jar@mumble.net>, <www-tag@w3.org>
Message-ID: <002d01c7cc9f$016e7130$c4e84d57@named4gc1asnuj>
Hi Alan,

> Thanks for the more detailed information. While I agree with the need  to 
> be able to have a mechanism for making statements about URIs that  one 
> doesn't mint, such as http://www.w3.org/People/ Berners-Lee/  card#i, what 
> I don't follow in your discussion is why such additional  statements need 
> to be attached to an alias (in the sameAs sense) of  the original URI. It 
> would seem worth justifying this in the light of  the associated costs of 
> such aliases
>
> - The lower likelihood of successful "joins" in queries if a) Not all 
> "sameAs"s are available to an agent or b) The agent's reasoner isn't 
> capable of correctly handling sameAs
> - The uncertain semantics of sameAs when taken out of the context of  the 
> OWL specification.
>
> For instance, why not have e.g. dbpedia only name *resources* which  are 
> understood as "community statements about" some subject, in which 
> statements about tbl would use his designated name for himself?
>

Yes, in a perfect world you are right, but unfortunately, we are not living 
in a perfect world.

DBpedia is a good example for this. We are assigning URIs to 1,600,000 
resources and we don't have a clue which URIs we assign to some town, 
molecules, flowers or planets. We even don't know if we assign URIs to 
flowers at all, before we search within our dataset for flowers.

We do this because we want to create a useful open dataset in the short 
term. If we would wait until there is community agreement in each domain 
that DBpedia covers about a naming schema or wait until each of the 
described resources has assigned a URI to itself, we won't get anywhere. If 
there would be community agreement about naming schemata (which there is not 
and I also do not expect such agreement to evolve in the mid-future), the 
next problem would be to bring some complicated infrastructure into place 
that allows applications like DBpedia to find out that 
http://www.w3.org/People/Berners-Lee/card#i is only URI that should be used 
to refer to Tim (think about stuff like URI SPAM and all the trust mechanism 
such an infrastructure would need).

So, I think that the approach of assuming that single URIs for identifying 
real-world resources will evolve does not scale for practical reasons.

Evidence for this opinion can be found in the Linking Open Data project 
(http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData) 
where most datasources are backed by large legacy databases and it is 
unrealistic to require publishers to find out the only acceptable URI for 
each of their 100 000 data items.

The project is aiming at having hundreds of billions RDF triples online in 
the mid-term. Think of data souces like Freebase (http://www.freebase.com/), 
the Open Library (http://demo.openlibrary.org/) or all public US government 
data 
(http://www.cs.umd.edu/class/spring2006/cmsc838s/data_repositories/repository_us.html).

In such situations, I think it is more realistic from the practical point of 
view to use a two step process:

1. Allow each data provider to assign his own URIs to resources (not much 
effort for him, just dump his database as Linked Data).
2. Use some equivalence mining algorithms afterwards to find out which URIs 
talk about the same things.

We do a lot of such equivalence mining in the Linking Open Data project and 
it works fine (good enough).
See: 
http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/#autogenerateLinks 
and
http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining

I agree with you that this approach has a "lower likelihood of successful 
"joins"", but I rather prefer to data mine useful information out of a pile 
of junk than to wait until there is community agreement about ontologies and 
naming schemata.

Note, that this approach is also taken by Google Base and these guys are 
already rather successful with it.

I'm also not too concerned about the "agents reasoner not being capable of 
correctly handling sameAs". I expect that agents and search engine will 
implement reasoners for specific sets of predicates (and owl:sameAs is very 
likely to be in this set). I'm sceptical about general RDF-S/OWL reasoners, 
because it will take a while until they are capable to handle hundreds of 
billions of triples and this is the amount of data that we need to in order 
to relevant in the light of Web 2.0.

Cheers

Chris


> -Alan
>

On Jul 20, 2007, at 9:02 AM, Chris Bizer wrote:

> Hi Alan,
>
>> However, I am curious to know what you were asking, so if you do,  I will 
>> be appreciative.
>
> My question was aiming more into the direction of how AWWW and OWL 
> terminology plays together.
>
> owl:sameAs if defined as "The built-in OWL property owl:sameAs  links an 
> individual to an individual. Such an owl:sameAs statement  indicates that 
> two URI references actually refer to the same thing:  the individuals have 
> the same "identity" (http://www.w3.org/TR/owl- ref/#sameAs-def)
>
> There was a long discussion and a lot of confusion on the SemWeb  list 
> about two weeks ago whether owl:sameAs is the right predicate  that should 
> be used to indicate that two URIs refer to the same  "thing". With "thing" 
> being a OWL term that does not exist in AWWW  terminology.
>
> So, if the anwer to my first question would have been that the  different 
> URIs for Tim refer to different resources, there would  have been a 
> problem with "refering to the same thing". But as Dan's  answer to my 
> first question indicated that the different URIs refer  to the same 
> non-information resource, meaning that they are URI  aliases, there is no 
> problem and I see the issue as being closed.
>
> Building on this, we argue in section 1.1 of our Linked Data  tutorial 
> (http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/ 
> LinkedDataTutorial/#aliases) that URI aliases provide an important  social 
> function to the Web as they are dereferenced to different  descriptions of 
> the same non-information resource and thus allow  different views and 
> opinions to be expressed.
>
> Which is an interesting conclusion as it conflicts with the AWWW  view 
> that URI aliases are harmful.
> See http://www.w3.org/TR/webarch/#uri-aliases
>
> Cheers
>
> Chris
>
>
> --
> Chris Bizer
> Freie Universität Berlin
> Phone: +49 30 838 54057
> Mail: chris@bizer.de
> Web: www.bizer.de
>
> ----- Original Message ----- From: "Alan Ruttenberg" 
> <alanruttenberg@gmail.com>
> To: "Chris Bizer" <chris@bizer.de>
> Cc: "Dan Connolly" <connolly@w3.org>; <www-tag@w3.org>; "SW-forum  Web" 
> <semantic-web@w3.org>; "Linking Open Data" <linking-open- 
> data@simile.mit.edu>; "Jonathan A Rees" <jar@mumble.net>
> Sent: Friday, July 20, 2007 2:28 PM
> Subject: Re: Terminology Question concerning Web Architecture and  Linked 
> Data
>
>
> Hi Chris,
>
> Your assessment is perfectly reasonable. I was thrown off by the
> question you initially asked:
>
>> Question 3: Depending on the answer to question 1, is it correct  to use 
>> owl:sameAs [6] to state that http://www.w3.org/People/  Berners-Lee/ 
>> card#i and http://dbpedia.org/resource/Tim_Berners- Lee refer to the same 
>> thing as it is done in Tim's profile.
>
> Given that you didn't intend the sense of "correct" that I thought
> (recall that I was guessing, from context, which sense of correct you
> meant in your question), which sense of "correct" did you mean? Or to
> phrase it another way, if one were to answer the question "no", what
> sort of evidence would you accept to support that answer.
>
> This isn't a matter of philosophy, it's a matter of communication. I
> really don't know what you are asking. Another way to accomplish the
> communication would be to rephrase the question without using the
> word "correct".
>
> I don't mean to suggest you are obligated to clarify this for me.
> However, I am curious to know what you were asking, so if you do, I
> will be appreciative.
>
> -Alan
>
>
>
> On Jul 20, 2007, at 3:55 AM, Chris Bizer wrote:
>
>> Hi Alan,
>>
>> I'm not a philosopher, but I have the feeling that the concept  "correct" 
>> in a sence of matching reality does not really apply to  the Semantic Web 
>> setting.
>>
>> We are talking about machines that are supposed to process data  from 
>> different sources. There is no such thing as "reality" for a  machine. 
>> For the machine there is only data! (or knowledge if you  prefer this 
>> term)
>>
>> Therefore the question for the machine is: Should it trust a  specific 
>> piece of data or not? Or more precisely how can it assess  the quality of 
>> the data to a point where it matches the quality  requirements of the 
>> user (human).
>>
>> There are lots of different heuristics that a machine can apply to 
>> assess information quality, including content-based, context- based, 
>> rating-based heuristics.
>>
>> For more details than you ever wanted to hear, please refer to my  PhD 
>> thesis titeld "Quality-driven Information Filtering in the  Context of 
>> Web-based Information System" http://sites.wiwiss.fu- 
>> berlin.de/suhl/bizer/pub/DisertationChrisBizer.pdf
>>
>> Cheers
>>
>> Chris
>>
>> --
>> Chris Bizer
>> Freie Universität Berlin
>> +49 30 838 54057
>> chris@bizer.de
>> www.bizer.de
>> ----- Original Message ----- From: "Alan Ruttenberg" 
>> <alanruttenberg@gmail.com>
>> To: "Dan Connolly" <connolly@w3.org>
>> Cc: "Chris Bizer" <chris@bizer.de>; <www-tag@w3.org>; "SW-forum  Web" 
>> <semantic-web@w3.org>; "Linking Open Data" <linking-open- 
>> data@simile.mit.edu>; "Jonathan A Rees" <jar@mumble.net>
>> Sent: Friday, July 20, 2007 4:52 AM
>> Subject: Re: Terminology Question concerning Web Architecture and  Linked 
>> Data
>>
>>
>>> On Jul 10, 2007, at 1:08 PM, Dan Connolly wrote:
>>>> On Sat, 2007-07-07 at 14:43 +0200, Chris Bizer wrote:
>>>>
>>>>> Question 3: Depending on the answer to question 1, is it   correct to 
>>>>> use
>>>>> owl:sameAs [6] to state that http://www.w3.org/People/Berners-  Lee/ 
>>>>> card#i and
>>>>> http://dbpedia.org/resource/Tim_Berners-Lee refer to the same  thing 
>>>>> as it is
>>>>> done in Tim's profile.
>>>>
>>>> Yes...
>>>>
>>>> That's sort of a circular question. It's correct because Tim   says 
>>>> it's correct, and he owns that name.
>>>
>>> That's not the usual sense of "correct". In this context, I  believe 
>>> that the wordnet sense of "correct" that is intended is
>>> "free from error; especially conforming to fact or truth"
>>>
>>> Or Wikipedia: "In everyday use, the correctness of a statement is 
>>> determined by whether or not it matches reality. People can think  a 
>>> statement is correct and be wrong."
>>>
>>> If I had a profile that said, in effect, that I was president of  the 
>>> United States, then that would be incorrect regardless of  whether I 
>>> owned the name (I am taking the "owned name" that you  are referring  to 
>>> to be http://www.w3.org/People/Berners-Lee/  card#i since that's the 
>>> only name in the vicinity that Tim could  correctly claim to be owned 
>>> by him).
>>>
>>> If I'm using the wrong sense of "correct", perhaps you could  provide 
>>> me a definition of "correct" by which I could understand  your claim.
>>>
>>> Regards,
>>>
>>> -Alan
>>
>
Received on Sunday, 22 July 2007 20:29:46 UTC