Expliciting same-ness rules Re: [Linking-open-data] Terminology Question concerning Web Architecture and Linked Data from Bernard Vatant on 2007-07-09 (semantic-web@w3.org from July 2007)

From: Bernard Vatant <bernard.vatant@mondeca.com>
Date: Mon, 09 Jul 2007 10:17:40 +0200
To: Linking Open Data <linking-open-data@simile.mit.edu>
Cc: semantic-web@w3.org
Message-ID: <4691EF24.6030000@mondeca.com>
Hi Chris
> Here is the problem statement together with an example: Within the Linking 
> Open Data community project [2] different data sources (URI owners) publish 
> information about Tim Berners-Lee ...
There are strong implicit underlying assumptions here. Expliciting them 
would maybe help to answer your questions.

1. Uniqueness of the thing/subject you speak about. You assume there is 
one Tim Berners-Lee. And this includes, but goes beyond simple homonymy 
issues. Maybe one would like to make distinct 
"Tim-Berners-Lee-the-private-person" and 
"Tim-Berners-Lee-the-public-person". So if uniqueness of Tim Berners-Lee 
is taken for granted, you should be explicit about it.
2. To put together the following URIs, you have set rules, or whatever 
heuristics, to discover that two URIs are "aliases" of the same 
non-information resource. And seems to me that those rules/heuristics 
should be exposed explicitly. We are in the Linking Open Data process. 
The data are open, so should be the rules used to link them. Open means 
the rules are explicit and exposed, so that anyone can reproduce their 
behaviour, and accept or not to play by those rules. The completely 
opposite process is e.g., Google News, where resources are gathered and 
displayed as being "related to the same event", without any explicit 
statement about how this event is identified (and let alone selected to 
appear or not), and how a resource is considered to be "about this 
event". Of course, Google does not expose its smart algorithms, but at 
least it's clear that they exist and are implemented somehow.

When you apply such rules to structured data, they could/should be 
expressed formally in whatever relevant language, e.g., as SPARQL 
CONSTRUCT queries if all data are RDF.
The clauses under which you consider you can safely declare a:foo 
owl:sameAs b:bar certainly rely on elements of descriptions of a:foo and 
b:bar, like e.g., equality of  type (Person) + first name + given name + 
birth date + birth place. Elements which actually can be present in the 
two descriptions using the same properties or different ones but that 
your heuristics assume to be equivalent.
When we set up for Geonames  the owl:sameAs assertions between Geonames 
URIs and INSEE URIs for administrative entities in France, the heuristic 
was based on such matching of typing properties on both sides (INSEE 
Class Region <=> Geonames fcode ADM1, INSEE Class Departement = Geonames 
fcode ADM2 etc) then matching of names (including dealing with case and 
special characters issues), and resolution of homonymy cases based on 
administrative hierarchy.
Granted, such rules are no more explicited on Geonames than the rules 
used to match the aliases of Tim Berners-Lee on DBpedia. But I think 
they could and should be in both cases.
> using different HTTP URIs:
>
> 1. DBpedia: http://dbpedia.org/resource/Tim_Berners-Lee
> 2. Hannover DBLP Server: 
> http://dblp.l3s.de/d2r/resource/authors/Tim_Berners-Lee
> 3. Berlin DBLP Server: 
> http://www4.wiwiss.fu-berlin.de/dblp/resource/person/100007
> 4. RDF Book Mashup: 
> http://www4.wiwiss.fu-berlin.de/bookmashup/persons/Tim+Berners-Lee
>
> ...
>
> 5. Tim also publishes a FOAF profile in which he assigns the URI 
> http://www.w3.org/People/Berners-Lee/card#i to himself.
>
> Question 1: According to the terminology of the Architecture of the WWW 
> document [4] are all these URIs aliases for the same non-information 
> resource (our current view) or are they referring to different resources? 
>   
As said above, it's up to the publisher of the "owl:sameAs" assertions, 
to explicit the rules. If you consider those URIs to be aliases of the 
same resource, be bold :-) , say so, but say why.

> Does the TAG finding "On Linking Alternative Representations To Enable 
> Discovery And Publishing " [5] about generic and specific resources apply 
> here, meaning that the URIs 1,2,3,5 refer to different specific 
> non-information resources that are related to one generic non-information 
> resource?
>
> Question 2: When the URIs are dreferenced they provide quite different 
> information about Tim, which reflects the knowledge and the opinion of the 
> specific URI owner about him. Within our tutorial we need to talk about this 
> information and therefore need a term to refer to a concept that can be 
> described as "information provided by a specific URI owner about a 
> non-information resource", for example Tim. Depending on the answer to 
> question 1, what would be the correct Web Architecture term to refer to this 
> concept? Or is such a term missing?
>   
Is not such information called a "Description", with a "D" like in RDF? 
Or do I miss something more subtler?
> Question 3: Depending on the answer to question 1, is it correct to use 
> owl:sameAs [6] to state that http://www.w3.org/People/Berners-Lee/card#i and 
> http://dbpedia.org/resource/Tim_Berners-Lee refer to the same thing as it is 
> done in Tim's profile.
>   
See above ...

-- 

*Bernard Vatant
*Knowledge Engineering
----------------------------------------------------
*Mondeca**
*3, cité Nollez 75018 Paris France
Web:    www.mondeca.com <http://www.mondeca.com>
----------------------------------------------------
Tel:       +33 (0) 871 488 459
Mail:     bernard.vatant@mondeca.com <mailto:bernard.vatant@mondeca.com>
Blog:    Leçons de Choses <http://mondeca.wordpress.com/>
Received on Monday, 9 July 2007 08:17:56 UTC