Re: URIs and Unique IDs from John Graybeal on 2008-11-03 (semantic-web@w3.org from November 2008)

From: John Graybeal <graybeal@mbari.org>
Date: Sun, 2 Nov 2008 20:24:41 -0800
To: Michael F Uschold <uschold@gmail.com>
Cc: semantic-web@w3.org, aldo.gangemi@gmail.com, "Conor Shankey" <cshankey@reinvent.com>, "Peter Mika" <pmika@yahoo-inc.com>, "Ora Lassila" <ora.lassila@nokia.com>, "Pan, Dr Jeff Z." <jeff.z.pan@abdn.ac.uk>, "Tim Berners-Lee" <timbl@csail.mit.edu>, "Frank van Harmelen" <Frank.van.Harmelen@cs.vu.nl>, sean.bechhofer@manchester.ac.uk
Message-Id: <110C173D-B0BD-4744-8737-610E5C28088A@mbari.org>
Michael,

I will try to be clearer -- your confusion was my fault, sorry.   
Appreciate very much your comments (and charitable interpretations,  
n.b. 'strong intuition' :->!).

On Nov 1, 2008, at 9:33 AM, Michael F Uschold wrote:

> Agreed.  I assume you mean the vocabulary is the ontology? Are we  
> assuming OWL ontologies here, if not then what do you mean by a  
> vocabulary?

yes, I used 'vocabulary' to reflect what our customers have, but an  
OWL ontology is what we will generate.
>   B. A vocabulary contains all the terms within it, not just the  
> terms that changed in that version
Here is my folksy perspective behind the model (more justifications  
near the end):  If I say to a user "Here is a vocabulary dated X", the  
user will assume that all the terms come with that vocabulary, and the  
terms of that vocabulary are also dated X. So can I build a working  
semantic approach that accepts this assumption?

>
> So in the SKOS example, the new SKOS vocabulary/ontology would  
> contain the terms that do not change URIs as well as terms with new  
> versions with new URIs.

No, sorry, I was sloppy and used 'terms' and 'URIs' interchangeably.  
Here is the easy part: You can assume that if anything in the  
specification of a term changes while its string of characters remain  
the same, I will insist on a new URI for that term (the version string  
will do nicely to discriminate). And if the string of characters for a  
term changes, that will be a new URI too.

I am using  'term' to mean 'a string of characters that likely, but  
not necessarily, means something to a human'.  So codes and opaque  
terms are OK.  For most ontologies we'll create, terms will be words  
and word phrases.

Anticipating your later comments, we concluded (you won't like this at  
all):
  1. a URI is a suitable UID
  2. a term can be part of a suitable URI
The first is argued elsewhere by others.

Re the second:  Since what I really wanted to do was give people a way  
to say "here's what this string of characters means", it doesn't  
bother me that the same string of characters may mean something else  
later -- I need the UID not for the _concept_, but for the unique  
string of characters. That will always be the string of characters I  
want that UID to refer to. So making the UID a URL that embeds the  
string of characters was acceptable.

I think I understand the concerns about non-opaque and non-persistent  
URLs, and believe that those costs are relatively low compared to the  
resulting early adopter benefits of this approach. [1]

> This issue arises because of the conflation of URIs, UIDs and human- 
> readable IDs. Until these are de-conflated, probably this principle  
> is the right one. It will be unnecessary after de-conflation.

I am unconvinced de-conflation can happen, at least in our lifetimes,  
which is why I made some of those horrible linkages above.
> D. It must be possible to 'look up' the current meaning of a term,  
> as well as specifically request any past meanings by their URI
> I read this that a term like 'broader' in SKOS could have multiple  
> URIs for multiple versions.  If this is what you mean, then I  
> absolutely agree with this.

Yes, this is what I mean, but keep in mind my previous conflation.

> Agreed. You seem to be proposing the idea of some kind of object  
> (perhaps with a URI)  that corresponds to the core term, and that  
> its various meanings are related versions are linked to the core  
> term. This may be a workable idea. Can this be done with the  
> current  semantic web infrastructure?

Oh, I sure hope so. (Well, new relationships may be needed. Not an  
expert here.)  We are doing it a shade outside of the 'strict  
infrastructure', if there is such a thing -- our server will try to be  
smart about the relationships between vocabulary versions (well, it  
has to be, to make sure the version relationships are maintained).
>   bb. Any significant definitional or semantic change to a term  
> should really create a new term, not just evolve the word we were  
> already using (what was SKOS thinking?)
> This is an interesting question with more than one reasonable  
> position. I think there are at least two cases:
> 1. there was a bonified conceptual error, and everyone agrees that  
> the old meaning was the wrong one and it is not wanted.
> 2. there is a new alternative, that works in some cases, and some  
> may also wish to use the older versions.
>
> For 1. you do NOT want to change the name f the term, was and is the  
> right term.  But you DO want to change its UID because it is a  
> different thing.

>
> For 2, you probably want to introduce a new term with a new name and  
> a new UID. You could have the name of the transitive version of  
> broader be called broaderT and the non-transitive one be called  
> broader.

yes to all the above, well put.

> You should be able to change the name w/o changing the UID.

Well, OK, maybe.  Not for my own vocabularies, because those are  
trying to define strings, not concepts. As you can see, I am hung up  
on which thing someone has in mind when they say the name -- is it the  
concept behind the name, or the name itself? I find it a lot easier to  
consider the name the resource of interest, and if someday my  
'inflammable' is redefined to mean flammable, then my ontology will be  
exactly as wrong as all the books that used the 'old' definition of  
the word.  (At least until I redefine the word. Sure hope everyone is  
using timestamps. :->)
>   cc. Created relationships to 'most current' URIs persist even as  
> the semantics of that resource may change; this potentially  
> introduces a time quality to inferences done with these resources  
> (e.g., "Today's New York Times has an article on election polls" may  
> be true statement today, but false next week.)  Those who choose to  
> use the 'most current' term will get what they pay for.
>
> You might be able to have programmatic or infrastructural capability  
> which can return the 'most current' version of a given core term.  
> There might be a URI/UID for the core term, and that is what would  
> be accessed. There, a directive would be given that says please  
> return the the most recent version of that item. This is a promising  
> idea that could probably keep everyone happy.
>   ee. Both the provided service, and ontology engines in general,  
> must be able to relate terms to their semantically identical  
> historical counterparts
> When every version of every term has its own UID, then this becomes  
> feasible, though it may also be an expensive overhead.
>   ff. The service should be able to quickly identify/present to its  
> users each change in semantic meaning for a term.
> Yes, and an application should also be able to subscribe to the core  
> UID for a concept to be notified of any changes so it can keep up to  
> date automatically in the case where the most uptodate version is  
> wanted, and otherwise people can look into new versions on a case by  
> case basis.

Yes to all the above, and to the 'timestamps may be expensive' also.   
I am worried about expense, but suspect I won't be able to tell for a  
while how resource-intensive this will be, and whether optimization  
will take care of it, and whether I still will be paid to "keep this  
problem solved."  But I plan as if I will...
> There may be some clear cut cases where you can tell which things  
> are static vs. dynamic. However IMHO, it is likely that a lot  
> (perhaps most) case will be dependent on the needs of the  
> application, and the same concept may be dynamic in some  
> applications and static in others.
Maybe.  If I declare the static concept is forever unvarying by  
definition, I don't think it would be strategic for an application to  
assume otherwise.
>
> I am less sanguine about this for predicates -- it seems like you're  
> allowing replacing the engine while the car is running.
> I don't follow this analogy.
> I can imagine a future scenario where this is advantageous for  
> predicates, but it seems really inappropriate at this stage.
> You have a strong intuition that I'm not able to grasp.  Can you  
> articulate why with an example?

OK, my examples uses 'sea surface temperature' as a subject, and  
'sameAs' as a predicate. If, over time, the concept associated with  
'sea surface temperature' evolves from "any measurement of any body of  
sea water within a meter or so of the ocean's surface" to "an informal  
reference to the concept of temperature near the ocean surface  
(deprecated as a reference to a particular measurement)", the tools I  
have written may produce some less-than-ideal inferences if they  
assume the new definition applies to old data, or vice-versa.  Even if  
the new definition in 100 years is "measurement of the temperature of  
the foam we keep on top of the  ocean to keep it cool", some  
inferences could be faulty, but the engine won't break down.

But if I've originally used 'sameAs' in mappings to mean that two  
concepts are analogous in certain defined ways (maybe a faulty  
original practice, but go with it), and then the term is redefined by  
general consensus to mean "refers to the exact same resource", I have  
some really broken results, because a key piece right in the middle of  
my infrastructure has changed.  If you try to change important parts  
in a car while it's moving, bad things can happen, even if the new  
part is every bit as good as the old part. If we try to change the  
meaning of core terms used in semantic inferencing, then all the tools  
and things are likely to behave oddly during the change, if not  
afterwards as well.
> As to the multiple URIs for a single concept problem that was  
> introduced in (aa) above, I have both a justification and a backup  
> plan.  The justification is that the meaning of terms and their  
> definitions is inferred in a context, and changes to the context  
> (the rest of the vocabulary) can affect the implicit meaning, or  
> usage, of a term that nominally wasn't changed.
>
> This is true, and the reason why terms/words in wordnet belong to  
> multiple synsets. Each synset has a unique meaning, and in the owl  
> dataset, each synset has its own URI. So I don't find your argument  
> convincing.  Multiple context shows different uses of a term, so  
> each use should get a different UID, not the same one.

This is  a different context. Example below.
> So even if I haven't changed the explicit definition of a term in a  
> new vocabulary release, it is meaningful to consider this term a new  
> resource, and give it a new URI, to reflect its new context.
> Maybe the wordnet example is a read herring.  In any event, can you  
> provided a clear example of how an application would find it helpful  
> to have whole new sets of URIs minted for identical things?

Here's a simple example, before giving you a detailed domain-specific  
example: Let's say I review a vocabulary and change 80% of the  
definitions. But the remaining definitions are deemed good and remain  
unchanged.  By virtue of being part of a heavily reviewed vocabulary,  
these remaining original terms have gained credibility -- they are  
more reviewed and more trusted then they were before that version was  
created.

For a domain example, let's go back to sea surface temperature.  5  
years ago, it meant something like "any measurement of any body of sea  
water within a meter or so of the ocean's surface".  More recently,  
data managers realized that wasn't specific enough. So 5 new terms  
were created to precisely delineate the difference kinds of sea  
surface temperature.

Now, if I get a set of data that uses some of these new terms to label  
variables, and also has an item labelled 'sea surface temperature', I  
can infer that the use of the broader variable meant that no more  
specific description could be provided.  Whereas in data from 5 years  
ago, I might replace the general term in many cases, by looking at  
other metadata to learn the more specific term. With the existence of  
the new terms, the old term has new connotations.

> Here is one example where it is clearly a bad thing.
> The application is ontology-driven at a deep level. It makes use of  
> the resources in the coding/creation of application functionality.  
> It also loads and makes use of data using the ontology.
> T1: application loads ontology using original terms.
> T2: application loads data expressed using the original terms
> T3: all new URIs are minted, when only a few have changed semantics,  
> and there is no indication of which ones have new semantics and  
> which have the same semantics.

Well, this is bad but not unmanageable. Presumably a query of the  
'before' and 'after' resources for those two concepts would reveal  
whether or not there are differences.  Or, presumably you can query  
the ontologies to get that info, even if you can't query the terms  
themselves. (Hmm, in today's semantic web a lot of times you don't  
have the original ontology versions either, do you?  But that would be  
another thing that breaks the system to some degree, you don't have  
any ability to validate previous inferences or see what it was like  
when the relationships were created, so you can't validate them  
independently. Sigh....)

But in any case, I accept the challenge here and say again "it only  
works if the new URIs can say whether they are the same semantics as a  
previous version."  Otherwise, I agree it's a bad thing.

> T4: A new dataset is created which uses the new URIs
> T5: The application loads the new data
> T6: The application poses a query which uses the old URIs to filter  
> data.
> T7; The new URIs do not match the old ones, so the query only  
> returns data from the old URIs when it should return data from the  
> new dataset as well.
>
> This is clearly a bad thing.   Your proposal has to argue advantages  
> that offset the disadvantage here, in order for me to buy into it.

One mitigation of disadvantages is obtained if most of the users map  
to the 'most recent version' (core concept) of the term, not specific  
versions. I suspect this will be likely.

> Of course, it is also very important to say this new resource has  
> the same definition and semantics as another, previous resource,  
> preferably pointing back to the original instance with that  
> definition/semantics.
>
> This creates an unnecessary burden and seems to contradict your  
> point that something in a different context will have different  
> semantics. If it has different semantics, then why point back to  
> something with identical semantics?

An excellent point.  (I'm busted!)  Apparently I differentiate between  
explicit meanings, which one finds in the term's resource description,  
and implicit meanings, which one finds in larger context. The version  
relationship primitives have to be understood to refer to the explicit  
meanings only. When the definition changes explicitly, that's a URI  
change that no longer can be considered exactly the same concept.

> I still can't see any advantages for creating multiple copies of  
> exactly the same thing.
> Have I missed something?

The practical advantage is the one introduced at the top -- I can  
consider and implement the vocabulary as a unit, carrying all of its  
components along with it.  Conceptually/abstractly I suspect this may  
be the right way to think of a vocabulary.

But more practically, this gives me a trivial way to generate URIs for  
those terms, a trivial way to capture the contents of each new version  
of the ontology (otherwise I have to analyze every term to decide if  
it is different, right?), a trivial way to explain to the user what  
the URI for each term will look like, and a way to tell from the term  
URI which vocabulary it's a part of (not that I'd ever do that to an  
opaque URI...).

but of course, I realize I have to go do some of these things latert,  
in any reasonable version of the system...I just don't have to do them  
*instantly*....
> I imagine we will have to create a relationship for our own use that  
> has this meaning for now.
> We probably will need some new infrastructural primitives, to relate  
> versions to each other.

Just so.

> This is a practical solution which would probably be pretty easy  
> when URIs are de-conflated with UIDs. Though proliferation of URIs  
> for the same thing should be reduced whenever possible.
>
> See another thread I started on similar topic by googling
> ["proliferation of URIs" uschold]

Excellent, I looked at the summary post and I see things with your  
level of concern, perhaps more than the responders. (Though I liked  
Tim's quote: "So multiple URIs for the same thing is life, a constant  
tradeoff, but life is, on balance good.")  I would be a relatively  
small scale offender for a while, but a bad example.

I will leave it there, too long a post for sure.

John

[1] Our URI creation scheme is described at http://marinemetadata.org/apguides/ontprovidersguide/ontguideconstructinguris 
  , with other details in that web neighborhood.
Received on Monday, 3 November 2008 04:26:09 UTC