RE: URIs and Unique IDs from tim.glover@bt.com on 2008-11-03 (semantic-web@w3.org from November 2008)

From: <tim.glover@bt.com>
Date: Mon, 3 Nov 2008 11:27:10 -0000
To: <graybeal@mbari.org>, <uschold@gmail.com>
Cc: <semantic-web@w3.org>, <aldo.gangemi@gmail.com>, <cshankey@reinvent.com>, <pmika@yahoo-inc.com>, <ora.lassila@nokia.com>, <jeff.z.pan@abdn.ac.uk>, <timbl@csail.mit.edu>, <Frank.van.Harmelen@cs.vu.nl>, <sean.bechhofer@manchester.ac.uk>
Message-ID: <AEF15555D64C494CA393778177A3A171054E5EB3@E03MVC1-UKBR.domain1.systemhost.net>
 
I agree with Michael Lang, who says that the community (or architect)
should decide how words are used in an ontology, and should agree on
changes.  Common sense suggests to me that reason for changing the
semantics of a word is to correct an error, in which case the same word
should be used, and existing systems will be freed from the error. I
cannot think of a good reason for changing the meaning of a word in the
context of an ontology otherwise. 
 
But I think its important to recognise that in most real systems there
are different levels of semantics. 
 
- Firstly there are some "keywords" in OWL, whose semantics is defined
by W3C and implemented by reasoning engine builders. 
 
- Secondly there will be some words that are not defined as part of
"OWL" but which are recognised as "keywords" by particular software
systems  ("if  x is a member of VIRAL_INFECTIONS do y").  In these cases
the software and the ontology are strongly bound together. 
 
- Thirdly there will be aspects of the ontology which are "data driven",
in that they are handled in a general way by the software ("find the
broader terms of a term").
 
Fourthly, There may also be a distinction between "A box" and "T box"
words.  
 
Moreover, a word may be significant to the software in one system, but
handled in a "data driven" way by a different system. 
 
It seems to me it is not at all clear cut to decide the best way to
modify these ontologies. 


________________________________

From: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org]
On Behalf Of John Graybeal
Sent: 03 November 2008 04:25
To: Michael F Uschold
Cc: semantic-web@w3.org; aldo.gangemi@gmail.com; Conor Shankey; Peter
Mika; Ora Lassila; Pan, Dr Jeff Z.; Tim Berners-Lee; Frank van Harmelen;
sean.bechhofer@manchester.ac.uk
Subject: Re: URIs and Unique IDs


Michael, 

I will try to be clearer -- your confusion was my fault, sorry.
Appreciate very much your comments (and charitable interpretations, n.b.
'strong intuition' :->!). 

On Nov 1, 2008, at 9:33 AM, Michael F Uschold wrote:


	Agreed.  I assume you mean the vocabulary is the ontology? Are
we assuming OWL ontologies here, if not then what do you mean by a
vocabulary?
	


yes, I used 'vocabulary' to reflect what our customers have, but an OWL
ontology is what we will generate.

		  B. A vocabulary contains all the terms within it, not
just the terms that changed in that version

	

Here is my folksy perspective behind the model (more justifications near
the end):  If I say to a user "Here is a vocabulary dated X", the user
will assume that all the terms come with that vocabulary, and the terms
of that vocabulary are also dated X. So can I build a working semantic
approach that accepts this assumption?


	

	So in the SKOS example, the new SKOS vocabulary/ontology would
contain the terms that do not change URIs as well as terms with new
versions with new URIs.


No, sorry, I was sloppy and used 'terms' and 'URIs' interchangeably.
Here is the easy part: You can assume that if anything in the
specification of a term changes while its string of characters remain
the same, I will insist on a new URI for that term (the version string
will do nicely to discriminate). And if the string of characters for a
term changes, that will be a new URI too.

I am using  'term' to mean 'a string of characters that likely, but not
necessarily, means something to a human'.  So codes and opaque terms are
OK.  For most ontologies we'll create, terms will be words and word
phrases.

Anticipating your later comments, we concluded (you won't like this at
all):
 1. a URI is a suitable UID
 2. a term can be part of a suitable URI
The first is argued elsewhere by others.

Re the second:  Since what I really wanted to do was give people a way
to say "here's what this string of characters means", it doesn't bother
me that the same string of characters may mean something else later -- I
need the UID not for the _concept_, but for the unique string of
characters. That will always be the string of characters I want that UID
to refer to. So making the UID a URL that embeds the string of
characters was acceptable.

I think I understand the concerns about non-opaque and non-persistent
URLs, and believe that those costs are relatively low compared to the
resulting early adopter benefits of this approach. [1]


	This issue arises because of the conflation of URIs, UIDs and
human-readable IDs. Until these are de-conflated, probably this
principle is the right one. It will be unnecessary after de-conflation.


I am unconvinced de-conflation can happen, at least in our lifetimes,
which is why I made some of those horrible linkages above.

		D. It must be possible to 'look up' the current meaning
of a term, as well as specifically request any past meanings by their
URI

	I read this that a term like 'broader' in SKOS could have
multiple URIs for multiple versions.  If this is what you mean, then I
absolutely agree with this. 


Yes, this is what I mean, but keep in mind my previous conflation.


	Agreed. You seem to be proposing the idea of some kind of object
(perhaps with a URI)  that corresponds to the core term, and that its
various meanings are related versions are linked to the core term. This
may be a workable idea. Can this be done with the current  semantic web
infrastructure?


Oh, I sure hope so. (Well, new relationships may be needed. Not an
expert here.)  We are doing it a shade outside of the 'strict
infrastructure', if there is such a thing -- our server will try to be
smart about the relationships between vocabulary versions (well, it has
to be, to make sure the version relationships are maintained).

		  bb. Any significant definitional or semantic change to
a term should really create a new term, not just evolve the word we were
already using (what was SKOS thinking?)

	This is an interesting question with more than one reasonable
position. I think there are at least two cases:
	1. there was a bonified conceptual error, and everyone agrees
that the old meaning was the wrong one and it is not wanted. 
	2. there is a new alternative, that works in some cases, and
some may also wish to use the older versions.
	
	For 1. you do NOT want to change the name f the term, was and is
the right term.  But you DO want to change its UID because it is a
different thing.


	For 2, you probably want to introduce a new term with a new name
and a new UID. You could have the name of the transitive version of
broader be called broaderT and the non-transitive one be called broader.



yes to all the above, well put.


	You should be able to change the name w/o changing the UID.
	


Well, OK, maybe.  Not for my own vocabularies, because those are trying
to define strings, not concepts. As you can see, I am hung up on which
thing someone has in mind when they say the name -- is it the concept
behind the name, or the name itself? I find it a lot easier to consider
the name the resource of interest, and if someday my 'inflammable' is
redefined to mean flammable, then my ontology will be exactly as wrong
as all the books that used the 'old' definition of the word.  (At least
until I redefine the word. Sure hope everyone is using timestamps. :->)


		  cc. Created relationships to 'most current' URIs
persist even as the semantics of that resource may change; this
potentially introduces a time quality to inferences done with these
resources (e.g., "Today's New York Times has an article on election
polls" may be true statement today, but false next week.)  Those who
choose to use the 'most current' term will get what they pay for.


	You might be able to have programmatic or infrastructural
capability which can return the 'most current' version of a given core
term. There might be a URI/UID for the core term, and that is what would
be accessed. There, a directive would be given that says please return
the the most recent version of that item. This is a promising idea that
could probably keep everyone happy.
	

		  ee. Both the provided service, and ontology engines in
general, must be able to relate terms to their semantically identical
historical counterparts 

	When every version of every term has its own UID, then this
becomes feasible, though it may also be an expensive overhead.
	

		  ff. The service should be able to quickly
identify/present to its users each change in semantic meaning for a
term.

	Yes, and an application should also be able to subscribe to the
core UID for a concept to be notified of any changes so it can keep up
to date automatically in the case where the most uptodate version is
wanted, and otherwise people can look into new versions on a case by
case basis.


Yes to all the above, and to the 'timestamps may be expensive' also.  I
am worried about expense, but suspect I won't be able to tell for a
while how resource-intensive this will be, and whether optimization will
take care of it, and whether I still will be paid to "keep this problem
solved."  But I plan as if I will...

		There may be some clear cut cases where you can tell
which things are static vs. dynamic. However IMHO, it is likely that a
lot (perhaps most) case will be dependent on the needs of the
application, and the same concept may be dynamic in some applications
and static in others. 

Maybe.  If I declare the static concept is forever unvarying by
definition, I don't think it would be strategic for an application to
assume otherwise.


		I am less sanguine about this for predicates -- it seems
like you're allowing replacing the engine while the car is running.

	I don't follow this analogy.

		I can imagine a future scenario where this is
advantageous for predicates, but it seems really inappropriate at this
stage.

	You have a strong intuition that I'm not able to grasp.  Can you
articulate why with an example?


OK, my examples uses 'sea surface temperature' as a subject, and
'sameAs' as a predicate. If, over time, the concept associated with 'sea
surface temperature' evolves from "any measurement of any body of sea
water within a meter or so of the ocean's surface" to "an informal
reference to the concept of temperature near the ocean surface
(deprecated as a reference to a particular measurement)", the tools I
have written may produce some less-than-ideal inferences if they assume
the new definition applies to old data, or vice-versa.  Even if the new
definition in 100 years is "measurement of the temperature of the foam
we keep on top of the  ocean to keep it cool", some inferences could be
faulty, but the engine won't break down.

But if I've originally used 'sameAs' in mappings to mean that two
concepts are analogous in certain defined ways (maybe a faulty original
practice, but go with it), and then the term is redefined by general
consensus to mean "refers to the exact same resource", I have some
really broken results, because a key piece right in the middle of my
infrastructure has changed.  If you try to change important parts in a
car while it's moving, bad things can happen, even if the new part is
every bit as good as the old part. If we try to change the meaning of
core terms used in semantic inferencing, then all the tools and things
are likely to behave oddly during the change, if not afterwards as well.

		As to the multiple URIs for a single concept problem
that was introduced in (aa) above, I have both a justification and a
backup plan.  The justification is that the meaning of terms and their
definitions is inferred in a context, and changes to the context (the
rest of the vocabulary) can affect the implicit meaning, or usage, of a
term that nominally wasn't changed. 


	This is true, and the reason why terms/words in wordnet belong
to multiple synsets. Each synset has a unique meaning, and in the owl
dataset, each synset has its own URI. So I don't find your argument
convincing.  Multiple context shows different uses of a term, so each
use should get a different UID, not the same one.


This is  a different context. Example below.

		So even if I haven't changed the explicit definition of
a term in a new vocabulary release, it is meaningful to consider this
term a new resource, and give it a new URI, to reflect its new context. 

	Maybe the wordnet example is a read herring.  In any event, can
you provided a clear example of how an application would find it helpful
to have whole new sets of URIs minted for identical things?


Here's a simple example, before giving you a detailed domain-specific
example: Let's say I review a vocabulary and change 80% of the
definitions. But the remaining definitions are deemed good and remain
unchanged.  By virtue of being part of a heavily reviewed vocabulary,
these remaining original terms have gained credibility -- they are more
reviewed and more trusted then they were before that version was
created.

For a domain example, let's go back to sea surface temperature.  5 years
ago, it meant something like "any measurement of any body of sea water
within a meter or so of the ocean's surface".  More recently, data
managers realized that wasn't specific enough. So 5 new terms were
created to precisely delineate the difference kinds of sea surface
temperature.

Now, if I get a set of data that uses some of these new terms to label
variables, and also has an item labelled 'sea surface temperature', I
can infer that the use of the broader variable meant that no more
specific description could be provided.  Whereas in data from 5 years
ago, I might replace the general term in many cases, by looking at other
metadata to learn the more specific term. With the existence of the new
terms, the old term has new connotations.


	Here is one example where it is clearly a bad thing. 
	The application is ontology-driven at a deep level. It makes use
of the resources in the coding/creation of application functionality. It
also loads and makes use of data using the ontology. 
	T1: application loads ontology using original terms. 
	T2: application loads data expressed using the original terms
	T3: all new URIs are minted, when only a few have changed
semantics, and there is no indication of which ones have new semantics
and which have the same semantics.  


Well, this is bad but not unmanageable. Presumably a query of the
'before' and 'after' resources for those two concepts would reveal
whether or not there are differences.  Or, presumably you can query the
ontologies to get that info, even if you can't query the terms
themselves. (Hmm, in today's semantic web a lot of times you don't have
the original ontology versions either, do you?  But that would be
another thing that breaks the system to some degree, you don't have any
ability to validate previous inferences or see what it was like when the
relationships were created, so you can't validate them independently.
Sigh....)

But in any case, I accept the challenge here and say again "it only
works if the new URIs can say whether they are the same semantics as a
previous version."  Otherwise, I agree it's a bad thing.  


	T4: A new dataset is created which uses the new URIs 
	T5: The application loads the new data
	T6: The application poses a query which uses the old URIs to
filter data.
	T7; The new URIs do not match the old ones, so the query only
returns data from the old URIs when it should return data from the new
dataset as well.
	
	This is clearly a bad thing.   Your proposal has to argue
advantages that offset the disadvantage here, in order for me to buy
into it.


One mitigation of disadvantages is obtained if most of the users map to
the 'most recent version' (core concept) of the term, not specific
versions. I suspect this will be likely.


		Of course, it is also very important to say this new
resource has the same definition and semantics as another, previous
resource, preferably pointing back to the original instance with that
definition/semantics. 


	This creates an unnecessary burden and seems to contradict your
point that something in a different context will have different
semantics. If it has different semantics, then why point back to
something with identical semantics? 


An excellent point.  (I'm busted!)  Apparently I differentiate between
explicit meanings, which one finds in the term's resource description,
and implicit meanings, which one finds in larger context. The version
relationship primitives have to be understood to refer to the explicit
meanings only. When the definition changes explicitly, that's a URI
change that no longer can be considered exactly the same concept.  


	I still can't see any advantages for creating multiple copies of
exactly the same thing.
	Have I missed something?


The practical advantage is the one introduced at the top -- I can
consider and implement the vocabulary as a unit, carrying all of its
components along with it.  Conceptually/abstractly I suspect this may be
the right way to think of a vocabulary. 

But more practically, this gives me a trivial way to generate URIs for
those terms, a trivial way to capture the contents of each new version
of the ontology (otherwise I have to analyze every term to decide if it
is different, right?), a trivial way to explain to the user what the URI
for each term will look like, and a way to tell from the term URI which
vocabulary it's a part of (not that I'd ever do that to an opaque
URI...).

but of course, I realize I have to go do some of these things latert, in
any reasonable version of the system...I just don't have to do them
*instantly*....

		I imagine we will have to create a relationship for our
own use that has this meaning for now.  

	We probably will need some new infrastructural primitives, to
relate versions to each other. 


Just so.


	This is a practical solution which would probably be pretty easy
when URIs are de-conflated with UIDs. Though proliferation of URIs for
the same thing should be reduced whenever possible. 
	
	See another thread I started on similar topic by googling 
	["proliferation of URIs" uschold]
	


Excellent, I looked at the summary post and I see things with your level
of concern, perhaps more than the responders. (Though I liked Tim's
quote: "So multiple URIs for the same thing is life, a constant
tradeoff, but life is, on balance good.")  I would be a relatively small
scale offender for a while, but a bad example.


I will leave it there, too long a post for sure.


John


[1] Our URI creation scheme is described at
http://marinemetadata.org/apguides/ontprovidersguide/ontguideconstructin
guris , with other details in that web neighborhood.
Received on Monday, 3 November 2008 11:28:10 UTC