RE: [SE] Composite Identification Schemes on the Semantic Web

Dear all

I have answered privately to Phil, thinking what I had to say was a bit out of the scope
of this WG/TF, but he suggested that it might be of interest to all. So please find below
this answer (just a bit re-worded and expanded in the last sections).

Phil Tetlow wrote :

> One minor point - The URI system is the foundation on which the Web is
> built. So (to use your words) I think that the time may well be 'right' to
> consider the validity of identification schemes that 'augment','complement'
> or 'extend' this system, rather than 'shift from it'. A subtle change in
> words - I hope you do not mind?

Well, actually, I do, and think it is not minor :)
Though English is not my native language, when I write "shift" I mean "shift"

- As is well attested by the neverending debates about URIs "meaning" (social or
otherwise), we are in a situation in which URIs share basically the same characteristics
as names in natural languages, or plain identifiers in various information systems, like
telephone numbers, or credit card numbers, which have no meaning outside the telephone
network context, or the bank network context. URIs are used to identify resources, but
there is not, and most likely there will never be any universal agreement on what a
resource exactly is, neither in general, nor in particular for any identified resource -
except trough a very recursive definition : "A resource is something identified by a URI"
and "This particular resource is what is identified by this particular URI" ... To make it
short, it's not because you've agreed on using, say, passport number, and/or Family Name +
First Name + Birth Date + Birth Place to identify a person, that you know what a person is
in general, or what/who this particular person, identified in such a way, is. You only
agree on some identification protocol when checking in at the airport. That's why I keep
saying : there is no (absolute) identity, there are only identification protocols.

- Like it or not, people will use the same URI in different contexts to identify different
things, whatever the strength of recommendations saying: "This is bad practice, you should
not do that". People will do it anyway, for various well known reasons : because they are
not aware of the fact that the URI they use is already used, or they are aware of it but
they don't understand the semantics already declared, or they don't care, or they think
this very URI should mean something else, or they deliberately want to screw up the system
etc.

- People will create a proliferation of new URIs when there are already a lot of them to
represent the concepts they need - see the 399 "foo#Person" URIs on Swoogle - because they
want them in their own namespace, because they are lazy, because they have not discovered
the existing URIs or they are not sure the existing one(s) mean exactly what they need, or
they don't trust the source etc.

- In short, URI-based languages, so to speak, are bound to evolve like all natural
languages, with a mess of homonymy, and synonymy, and ambiguity as the general rule, and
identification contexts, situations, protocols, conversations inside which ambiguity is
resolved, and used names hopefully identify the same thing for all the interlocutors in
the conversation (humans and machines). And, IMO, this reality is completely orthogonal to
the fact that URIs represent very formal elements in ontologies (say, a class in a
well-engineered OWL ontology) or loosely-defined plain RDF resources.

- Outside URI-based identification, there are already a lot of identification protocols
taking place on the Web, either based on non-URI but non-ambiguous identifiers such as
ISBN numbers (see http://isbn.nu), airport codes, country codes, language codes, etc ...
or composite identification schemes, or full-text entity recognition performed by NL tools
... (see Google News). Some of those protocols are pretty effective, some generate noise
and silence, and so far URI-based identification is just another of them, and it's no more
100% proof than any of those, for the above reasons.

Seeking dynamic and seamless integration of all various, existing, foreseen and unforeseen
identification protocols, is IMO the way to go, and yes somehow it "augments" the
URI-based identity system if you like to see it like that. But in fact it's not as if URIs
were the only identification tools, and other ones to be invented and added, they are
already here, and what we need is integration. If we don't look for integration, we will
keep on having on one side the so-called semantic technologies, seen as the academic, AI,
KR and logic camp, and on the other side the full-text, linguistic, heuristic,
fuzzy-but-efficient algorithms of Google and al. We really need both, not on two sides of
a no man's land, but working seamlessly together. Would not that be a "shift" from the
current state of things?

For the record, in Mondeca we've been working for a while with linguistic tools connected
with our semantic data bases, both with Danish and Italian research groups in
Computational Linguistics in the framework of the European project MOSES, and with our
partner Temis [1], including in customers projects both assistance to indexing and entity
and relationships extraction.  Matching the settings of NL processing components with
formal ontologies is a challenging task, but the results we have obtained so far in
domains like legal documentation or economic intelligence are really exciting.

Cheers

Bernard

[1] http://www.temis-group.com/

**********************************************************************************

Bernard Vatant
Senior Consultant
Knowledge Engineering
bernard.vatant@mondeca.com

"Making Sense of Content" :  http://www.mondeca.com
"Everything is a Subject" :  http://universimmedia.blogspot.com

**********************************************************************************

Received on Wednesday, 26 January 2005 22:46:45 UTC