Re: HTTP URIs for real world objects from Martin Hepp (UIBK) on 2008-01-18 (semantic-web@w3.org from January 2008)

From: Martin Hepp (UIBK) <martin.hepp@uibk.ac.at>
Date: Fri, 18 Jan 2008 09:30:55 +0100
To: Peter Ansell <ansell.peter@gmail.com>
CC: KANZAKI Masahide <mkanzaki@gmail.com>, Danny Ayers <danny.ayers@gmail.com>, Peter F Brown <peter@pensive.eu>, Bernard Vatant <bernard.vatant@mondeca.com>, Reto Bachmann-Gmür <reto@gmuer.ch>, Leo Sauermann <leo.sauermann@dfki.de>, public-sweo-ig@w3.org, semantic-web@w3.org, Paul Roe <p.roe@qut.edu.au>, James Michael Hogan <j.hogan@qut.edu.au>
Message-ID: <479063BF.7050607@uibk.ac.at>
Hi Peter:

 > I speak mainly because there are some editors on wikipedia who would
 > prefer not to have semantic markup on pages because it makes them ugly

First - using Wikipedia URIs as identfiers for concepts on the Semantic 
Web does not necessarily imply that anything is asserted about these 
URIs formally. (The Wikipedia page for John Lennon describes clearly 
enough which conceptual entity (i.e., John Lennon) it refers to, and the 
only ambiguity that may arise in this context is whether this URI refers 
to (1) the Wikipedia documents as an information resource or (2) the 
dead person John Lennon as a non-information resource. This, however, 
can be resolved easily by opening up a new namespace reserved for the 
respective non-information resources and creating a derived URI for each 
Wikipedia URI in this space e.g.

http://en.wikipedia.org/wiki/John_lennon ->

http://en.wikipedia.org/ontology/John_lennon)

DBpedia IDs may also serve this purpose.

So even without expanding Wikipedia, we can harvest the enormous amount 
of identifiers with a human-readable definition for weaving the Semantic 
Web. It is likely the largest set of consensual identifiers for 
conceptual entities in the world.

 > Wikipedia also does not create concepts until there is a sufficient
 > amount of "reliably published" information about them, and if they are
 > of no interest to people outside of the immediate community.

I disagree, and I have data to support my claim: We have shown in the 
IEEE Internet Computing paper [1] that the vast amount of Wikipedia URIs 
keeps on referring to the same meaning from the initial page to the most 
recent one. So they are not constantly changing. Sometimes the meaning 
broadens (e.g. if a page turns into aa disambiguation page, which can be 
understood as a superconcept of the original one).

In short, we found the following:

- More than 92 % of all 1.8 Mio. URIs we analyzed showed a stable 
meaning, i.e., they kept on referring to the same meaning.

- About 6.7% had a slight but not dramatic change in meaning so that the 
current definition was broader than the original one. This would still 
not invalidate earlier annotations of Web content. Most of these turned 
into disambiguation pages. (Our paper contains more details on that)

So the amount of Wikipedia URIs that is not reliable as identifiers is 
extremely small - the population estimate is between 0.66 and 0.89 
%(depending on whether we are using the Laplace or Wilson method). I bet 
that even centrally administered vocabularies will show inconsistencies 
in this order of magnitude.

 > I would be inclined to keep the new and constantly changing
 > identifiers within an organisations intranet-wiki and then publish
 > their relationships to outside identifiers when they become
 > accepted/published/interesting to outsiders.
 >
Postponing the official use of new identifiers just means making our 
vocabulary lack identifiers for novel concepts. Also, there is no better 
way of getting identifiers "accepted" than by encouraging other to try 
to use them in their communication. (We find the same pattern in human 
language - new terms get established by usage, not by standardization.)

Also, it does not hurt if Wikipedia provides URIs for topics that are 
relevant for a small community only. Still, it is better if there is a 
single namespace and infrastructure for those (I don't see any gain 
spreading those over numerous intranet-Wikis).


Best
Martin

[1] http://www.heppnetz.de/harvesting-wikipedia/

-----------------------------------------------
martin hepp, http://www.heppnetz.de


> On 17/01/2008, KANZAKI Masahide <mkanzaki@gmail.com> wrote:
>> yep, you can think, for example, an Wikipedia page as a Subject Indicator.
>>
>> :me a foaf:Person; foaf:interest wikipedia:Semantic_Web .
>> wikipedia:Semantic_Web foaf:primaryTopic concept:Semantic_Web .
>>
>> => :me foaf:topic_interest concept:Semantic_Web .
>>
>> In a sense, foaf:interest uses the object document as *an* indicator
>> of the subject(URI of such document is a Subject Identifier). And a
>> (P)SI can indicate the subject by using an IFP such as
>> foaf:primaryTopic.
>>
>> So we can almost think that an Wikipedia page is an PSI, except it
>> doesn't satisfy the last requirement of PSI: "A Published Subject
>> Indicator must explicitly state the unique URI that is to be used as
>> its Published Subject Identifier" (3.1.3 in spec).
> 
> This is a clean way to define the identifier without creating a new
> standard, other than the ontology. Of course, there is no need to
> intrude on wikipedia, as it has its own interests at heart and holds
> no claims to keep consistent URI's or to keep articles at any of their
> URI's. DBPedia seems like a better option for overlaying the knowledge
> in wikipedia with semantics.
> 
> I speak mainly because there are some editors on wikipedia who would
> prefer not to have semantic markup on pages because it makes them ugly
> (equating wikipedia's infoboxes to semantic content here), and is
> possibly incorrect (philosophy of not publishing anything till it is
> perfect and correct), and there is nothing a group of outsiders can do
> to change their point of view it seems.
> 
> Wikipedia also does not create concepts until there is a sufficient
> amount of "reliably published" information about them, and if they are
> of no interest to people outside of the immediate community. This
> leaves it closed to new information, so semantics can't grow within
> its vocabulary framework, and there can never be a proliferation of
> identifiers which are not going to be used outside of a small interest
> group. An equivalent wiki somewhere based on a specific interest area
> could go past the second restriction easily, but may still need to
> hold onto the first restriction otherwise it may be seen as
> unreliable.
> 
> I would be inclined to keep the new and constantly changing
> identifiers within an organisations intranet-wiki and then publish
> their relationships to outside identifiers when they become
> accepted/published/interesting to outsiders.
> 
> Peter Ansell
> 
>
Received on Friday, 18 January 2008 08:31:32 UTC