Synonym URIs Re: homonym URIs (Re: What if an URI also is a URL) from Bernard Vatant on 2007-06-12 (semantic-web@w3.org from June 2007)

From: Bernard Vatant <bernard.vatant@mondeca.com>
Date: Tue, 12 Jun 2007 12:32:51 +0200
To: Chris Bizer <chris@bizer.de>
Cc: Pat Hayes <phayes@ihmc.us>, Sandro Hawke <sandro@w3.org>, semantic-web@w3.org, Linking Open Data <linking-open-data@simile.mit.edu>
Message-ID: <466E7653.2020602@mondeca.com>
Hi all

Cris pointing at what is currently ongoing in the Linking Data Project 
is very relevant to this thread. We have there a real test bed to see 
how those things can fly and scale.

I would like to stress, on concrete examples from this project, an issue 
which has been less discussed than URIs ambiguity or homonymy  (on which 
everything and its contrary has been brilliantly exposed already in many 
threads). When bringing together data, we meet URIs defined and 
maintained independently by different actors, and *apparently* denoting 
*more or less* the same thing. Since I agree with Pat and al. that there 
is no way to make sure what is a URI referent (being a complete 
agnostic, I'm not even sure if referents exist independently of the 
signs), it's conceptually and technically impossible to know for sure 
when two URIs have the same referent, unless it can be inferred from 
their descriptions.
Let me take an example from Linking Open Data. Sucking Wikipedia 
articles, dbpedia is forging URIs and their descriptions (meaning?). 
Note that those URIs belong to dbpedia domain, so the semantics is their 
responsiblity, as stressed by Tim. But since those URIs are created 
automatically from Wikipedia articles URLs, this responsibility is 
somehow delegated through the algorithm, to Wikipedia editors.
So if one ask  what is the referent of http://dbpedia.org/resource/Berlin
of course its meaning is specified in the many elements of description 
provided by this very URI, as explained by Cris. But all those elements 
have been added to begin with by Wikipedia contributors, so the 
responsibility of the URI owner is limited to writing the smart 
RDF-isation algorithm. Actual definition of the referent is actually let 
to Wikipedia contributors who prepared the structured content is such a 
way that the structure can be interpreted as implicit semantics. If 
tomorrow Wikipedia editors collectively decide that Berlin is something 
completely different, the semantics of the above URI will be completely 
and automatically changed at the next update, without control of its 
owner. Not likely to happen too much for Berlin? Just wait.

Now from GeoNames side, we had defined in a completely different process 
the URI  http://sws.geonames.org/2950159/
Since the definition process were completely independent, there were no 
ways to infer that the referents were the same, nor that they were 
different. I suggested that the linking data process should capture this 
agnosticism, and have a way to declare : those URIs seem to be synonyms, 
but actually their declare semantics in somehow parallel layers, it's up 
to your system to take them as having the same referent or not. I 
proposed technical ways to represent this loose linking, using one level 
of indirection. People from dbpedia, including Cris and Richard, told me 
: This is conceptually correct, but a technical burden. We need quick 
and dirty linking, so "Be bold" said they, and use "owl:sameAs". I 
surrendered, and they went the owl:sameAs way, and so did GeoNames. 
That's why you now have in dbpedia description properties inferred from 
GeoNames description (for some reasons, not all of them, e.g. latitude 
and longitude attached to GeoNames, are not in the dbpedia description)

So far, so good. We have very rich descriptions. OTOH, no one is able at 
this point to know if they are consistent. Declaring synonymy using 
owl:sameAs will have unexpected consequences. So, at best, this quick 
and dirty road merges different layers of description of the same thing, 
at worse it has merged things that should not have been. In the latter 
case, we'll know it at some point, and then what shall we do.
This is not done for Germany, but in France GeoNames defines two 
different "features" (read: resources) for cities considered as ADM4 
(administrative entities) and as PPL (populated places). Which makes 
sense, because they convey definitely different semantics. See e.g. 
http://sws.geonames.org/3014258/ vs http://sws.geonames.org/6446645/ 
(that's my city :-) ). We had to decide when matching with French INSEE 
URIs, which one was to be mapped to http://rdf.insee.fr/geo/COM_05065. 
We chose the ADM4 (the latter). And be bold again, used owl:sameAs ...

Wikipedia does not make any difference so far between populated places 
and administrative entities having their seat in them. There is no 
article for Guillestre in English Wikipedia so far (one in French, 
http://fr.wikipedia.org/wiki/Guillestre, and others in Italian, Spanish 
...) so no dbpedia entry. When there will be, on which GeoNames entry 
will it map? On which basis? And when dbpedia goes multilingual,  will 
it consider different  resources for each wiki, or consider them as 
synonyms? 

Etc.  Beyond theoretical  arguments, Linking Open Data  has made 
pragmatic choices. The project builds up and is considered as a Semantic 
Web bootstrapping kernel.  I think we all should look closely at such 
details, and monitor their consequences as it scales up.

Bernard
>
> Hi Sandro and Pat,
>
>> My advice here is, I confess, not widely followed.  But I hear more and
>> more people converging on the idea that this is both practical and
>> likely to be sufficiently effective.
>
> Sandro: Just to back your claim that more and more people are 
> converging with some hard facts:
>
> Within the W3C SWEO Linking Open Data project, people are 
> collaborating to publish and interlink huge amounts of RDF data on the 
> Web according to Tim's Linked Data principles
> http://www.w3.org/DesignIssues/LinkedData.html
>
> Currently, this collaborative effort has "specified the meaning" (if 
> you want to see it this way) of maybe 10 million URIs covering topics 
> like geographic locations, books, publications, music, .... The 
> descriptions altogether amount to a dataset of about one billion RDF 
> triples.
>
> Any of this 10 million URIs can be looked up over the HTTP protocol to 
> retrieve a description of its meaning.
>
> Some example URIs from DBpedia (http://dbpedia.org/docs/) which forms 
> part of the Linking Open Data project:
>
> URI denoting to the concept of Berlin as a town in Germany:
>
> http://dbpedia.org/resource/Berlin
>
> RDF description about Berlin, which you get by dereferencing the URI 
> above with the mime type application/rdf+xml
>
> http://dbpedia.org/data/Berlin
>
> Human-readabale HTML description about Berlin, which you get by 
> dereferencing the URI above with the mime type text/html
>
> http://dbpedia.org/page/Berlin
>
> As you can see, the meaning of the term is pretty clearly defined by 
> putting it into several SKOS categories, having several rdf:type 
> statements about it and describing in in 10 different natural languages.
> All other 1 600 000 DBpedia terms are described in a similar way.
>
> An overview about the other 8 million concepts with dereferencable 
> URIs that were created in the project is given in 
> http://linkeddata.org/documents/eswc2007-poster-linking-open-data.pdf
> and on the project website 
> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData 
>
>
>> So when users paste that URI into their browser, they get the official
>> documentation about it.
>
> This behavior can be demonstrated with Semantic Web browsers like 
> Tabulator or DISCO or the OpenLink Data browser.
>
> Just click on a link below to start exploring the meaning of terms 
> using DISCO.
>
> The WWW 2006 conference
> http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%3A%2F%2Fwww4.wiwiss.fu-berlin.de%2Fdblp%2Fresource%2Frecord%2Fconf%2Fwww%2F2006 
>
>
> The Tetris computer game
> http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FTetris 
>
>
> Tim Berners-Lee
> http://www4.wiwiss.fu-berlin.de/rdf_browser/?browse_uri=http%3A%2F%2Fwww.w3.org%2FPeople%2FBerners-Lee%2Fcard%23i 
>
>
> Concerning "practical and sufficently effictive", I liked a recent 
> paper by Google about their plans for the Web-of-Data.
>
> "Web-scale Data Integration: You can only afford to Pay As You Go"
> http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p40.pdf
>
> The basic argumentation line is that we don't need completely 
> unambiguous terms and schemata to provide usefull services to the end 
> user. Even if the answers are only approximate they will be usefull 
> for the user. Google seams to handle this by using uncertainty on all 
> levels of their architecture including information extraction, schema 
> matching and query routing. At the end, this uncertainty goes into 
> their ranking algorithm and as the experience from the Web shows, 
> users are very happy with ranked approximate results where high 
> quality stuff tends to show up at the beginning of the list.
>
> Cheers
>
> Chris
>
>
> -- 
> Chris Bizer
> Freie Universität Berlin
> +49 30 838 54057
> chris@bizer.de
> www.bizer.de
> ----- Original Message ----- From: "Sandro Hawke" <sandro@w3.org>
> To: "Pat Hayes" <phayes@ihmc.us>
> Cc: <semantic-web@w3.org>
> Sent: Tuesday, June 12, 2007 12:11 AM
> Subject: homonym URIs (Re: What if an URI also is a URL)
>
>
>>
>>
>> Pat Hayes <phayes@ihmc.us> writes:
>>> Tim, as this discussion gets to the heart of what
>>> Ive been trying to argue for several years,
>>> please take the comments below as intended in a
>>> spirit of analysis rather than just pins and
>>> angels.
>>
>> Pat, I'm going to jump in here, if you don't mind.  I think my position
>> on these issues is pretty much the same as Tim's but I could be wrong.
>> I don't argue that John's "dance" isn't required, just that part of the
>> Semantic Web version of the dance is: don't make your URIs unnecessarily
>> ambiguous.  One might even say: don't pun.
>>
>>> And what about a URI
>>> that I own and wish it to denote, say, the planet
>>> Venus, or my pet cat? What do I do, to attach the
>>> URI to my intended referent for it?
>>
>> You publish a document (an ontology) so it's available through that URI.
>> If it's a hash URI, you publish the ontology at the non-hash version.
>> If it's a slash URI, you publish the ontology at the far end of a 303
>> redirect.  And you content-negotiate HTML and RDF.
>>
>> So when users paste that URI into their browser, they get the official
>> documentation about it.
>>
>> And when RDF software dereferences that URI, it gets some logical
>> formulas which should be understood (like the HTML) to be asserted by 
>> the
>> URI's owner/host/publisher.  Those formulas constrain the possible
>> meanings of that URI, relative to other URIs.  They can't nail a URI to
>> Venus, but they can use other ontologies to provide useful (and possibly
>> very constraining) information, like that it's an astronomical body with
>> a mass of about 5e+24kg.
>>
>> My advice here is, I confess, not widely followed.  But I hear more and
>> more people converging on the idea that this is both practical and
>> likely to be sufficiently effective.
>>
>>> The point surely is that URIs used to refer (not
>>> as in HTTP, but as in OWL) do *not* have a
>>> standardized meaning. Standards are certainly a
>>> chore to create, but they only go so far. OWL
>>> defines the meanings of the OWL namespace, but it
>>> does not define the meanings of the FOAF
>>> vocabulary,
>>
>> No, that's up to the owner(s) of the FOAF terms.
>>
>>> or the URIrefs used in, say,
>>> ontologies published by the NIH or by JPL.
>>
>> And that's up to the NIH and JPL, respectively.
>>
>>> The
>>> only way those meanings can be specified is by
>>> writing ontologies: and finite ontologies do not
>>> - cannot possibly - nail down referents
>>> *uniquely*.
>>
>> Ah -- there we go.  There must be a long history of this subject in
>> philosophy.  Can things ever be nailed down uniquely?  I haven't a clue.
>> But that's the wrong question.  In this thread, I don't think we're
>> talking about whether we can really be sure what we mean when we say
>> such a URI denotes Venus.  Instead, we're talking about whether it's a
>> good practice to use a single URI to denote clearly distinct things,
>> such as:
>>   (1) the second rock from the sun
>>   (2) the Roman goddess of love
>>   (3) a star tennis player
>>   (4) ... etc
>> The term "ambiguity" covers both these issues, but we don't need to
>> combine them.   The first is a kind of imprecision, a fuzziness, while
>> the second is the re-use of a word for a second meaning, a homonym.
>> (Homonyms seem to be called "overloading" in computer programming.)
>>
>> I think we know how to work with homonyms, but since we're engineering a
>> new system, it seems like a good design decision to forbid them, doesn't
>> it?
>>
>>    -- Sandro
>>
>
>
>

-- 

*Bernard Vatant
*Knowledge Engineering
----------------------------------------------------
*Mondeca**
*3, cité Nollez 75018 Paris France
Web:    www.mondeca.com <http://www.mondeca.com>
----------------------------------------------------
Tel:       +33 (0) 871 488 459
Mail:     bernard.vatant@mondeca.com <mailto:bernard.vatant@mondeca.com>
Blog:    Leçons de Choses <http://mondeca.wordpress.com/>
Received on Tuesday, 12 June 2007 10:34:13 UTC