Re: URIs and Unique IDs from Michael F Uschold on 2008-11-03 (semantic-web@w3.org from November 2008)

From: Michael F Uschold <uschold@gmail.com>
Date: Mon, 3 Nov 2008 09:21:15 +0100
To: "John Graybeal" <graybeal@mbari.org>
Cc: semantic-web@w3.org, aldo.gangemi@gmail.com, "Conor Shankey" <cshankey@reinvent.com>, "Peter Mika" <pmika@yahoo-inc.com>, "Ora Lassila" <ora.lassila@nokia.com>, "Pan, Dr Jeff Z." <jeff.z.pan@abdn.ac.uk>, "Tim Berners-Lee" <timbl@csail.mit.edu>, "Frank van Harmelen" <Frank.van.Harmelen@cs.vu.nl>, sean.bechhofer@manchester.ac.uk
Message-ID: <406b38b50811030021u63dba14fud2ce9171dc8e32b0@mail.gmail.com>
A short reply to the main point.

Uschold said:

I still can't see any advantages for creating multiple copies of exactly the
same thing.
Have I missed something?

Graybeal said:

The practical advantage is the one introduced at the top -- I can consider
and implement the vocabulary as a unit, carrying all of its components along
with it.  Conceptually/abstractly I suspect this may be the right way to
think of a vocabulary.

If you de-conflate URIs and UIDs we can have our cake and eat it too.

The new ontology is a unit with a UID that is different than the original
one.
It is a bundle consisting of its component terms and definitions, and there
is an ontology-has-component link that points to the UIDs of the most recent
version. Done, perfect.  Humans don't create or read UIDs, machines do.
Tools and names can be used to have the user see whatever you want them to
see.  This scheme gives the advantaage you want w/o minting new URIs for the
same thing.

Methinks that the conflation of URIs and UIDs makes it hard or impossible to
get this advantage unless you mint new synonym URIs.  Hence to de-conflate.

I'm convinced that something like this is the right thing to do, in
principle.  Finding out how it can be done in practice will be a lot of
work.
---

BTW, this discussion has inspired me to write a paper on the topic. Probably
too late to submit to WWW conference, but I will make whatever I have
available in some manner when it is ready

Thank you for helping me clarify my ideas on this.


Michael

On Mon, Nov 3, 2008 at 5:24 AM, John Graybeal <graybeal@mbari.org> wrote:

> Michael,
> I will try to be clearer -- your confusion was my fault, sorry.  Appreciate
> very much your comments (and charitable interpretations, n.b. 'strong
> intuition' :->!).
> On Nov 1, 2008, at 9:33 AM, Michael F Uschold wrote:
>
> Agreed.  I assume you mean the vocabulary is the ontology? Are we assuming
> OWL ontologies here, if not then what do you mean by a vocabulary?
>
>
> yes, I used 'vocabulary' to reflect what our customers have, but an OWL
> ontology is what we will generate.
>
>   B. A vocabulary contains all the terms within it, not just the terms that
>> changed in that version
>>
> Here is my folksy perspective behind the model (more justifications near
> the end):  If I say to a user "Here is a vocabulary dated X", the user will
> assume that all the terms come with that vocabulary, and the terms of that
> vocabulary are also dated X. So can I build a working semantic approach that
> accepts this assumption?
>
> So in the SKOS example, the new SKOS vocabulary/ontology would contain the
> terms that do not change URIs as well as terms with new versions with new
> URIs.
>
>
> No, sorry, I was sloppy and used 'terms' and 'URIs' interchangeably. Here
> is the easy part: You can assume that if anything in the specification of a
> term changes while its string of characters remain the same, I will insist
> on a new URI for that term (the version string will do nicely to
> discriminate). And if the string of characters for a term changes, that will
> be a new URI too.
>
> I am using  'term' to mean 'a string of characters that likely, but not
> necessarily, means something to a human'.  So codes and opaque terms are OK.
>  For most ontologies we'll create, terms will be words and word phrases.
>
> Anticipating your later comments, we concluded (you won't like this at
> all):
>  1. a URI is a suitable UID
>  2. a term can be part of a suitable URI
> The first is argued elsewhere by others.
>
> Re the second:  Since what I really wanted to do was give people a way to
> say "here's what this string of characters means", it doesn't bother me that
> the same string of characters may mean something else later -- I need the
> UID not for the _concept_, but for the unique string of characters. That
> will always be the string of characters I want that UID to refer to. So
> making the UID a URL that embeds the string of characters was acceptable.
>
> I think I understand the concerns about non-opaque and non-persistent URLs,
> and believe that those costs are relatively low compared to the resulting
> early adopter benefits of this approach. [1]
>
> This issue arises because of the conflation of URIs, UIDs and
> human-readable IDs. Until these are de-conflated, probably this principle is
> the right one. It will be unnecessary after de-conflation.
>
>
> I am unconvinced de-conflation can happen, at least in our lifetimes, which
> is why I made some of those horrible linkages above.
>
> D. It must be possible to 'look up' the current meaning of a term, as well
>> as specifically request any past meanings by their URI
>>
> I read this that a term like 'broader' in SKOS could have multiple URIs for
> multiple versions.  If this is what you mean, then I absolutely agree with
> this.
>
>
> Yes, this is what I mean, but keep in mind my previous conflation.
>
> Agreed. You seem to be proposing the idea of some kind of object (perhaps
> with a URI)  that corresponds to the core term, and that its various
> meanings are related versions are linked to the core term. This may be a
> workable idea. Can this be done with the current  semantic web
> infrastructure?
>
>
> Oh, I sure hope so. (Well, new relationships may be needed. Not an expert
> here.)  We are doing it a shade outside of the 'strict infrastructure', if
> there is such a thing -- our server will try to be smart about the
> relationships between vocabulary versions (well, it has to be, to make sure
> the version relationships are maintained).
>
>   bb. Any significant definitional or semantic change to a term should
>> really create a new term, not just evolve the word we were already using
>> (what was SKOS thinking?)
>>
> This is an interesting question with more than one reasonable position. I
> think there are at least two cases:
> 1. there was a bonified conceptual error, and everyone agrees that the old
> meaning was the wrong one and it is not wanted.
> 2. there is a new alternative, that works in some cases, and some may also
> wish to use the older versions.
>
> For 1. you do NOT want to change the name f the term, was and is the right
> term.  But you DO want to change its UID because it is a different thing.
>
>
> For 2, you probably want to introduce a new term with a new name and a new
> UID. You could have the name of the transitive version of broader be called
> broaderT and the non-transitive one be called broader.
>
>
> yes to all the above, well put.
>
> You should be able to change the name w/o changing the UID.
>
>
> Well, OK, maybe.  Not for my own vocabularies, because those are trying to
> define strings, not concepts. As you can see, I am hung up on which thing
> someone has in mind when they say the name -- is it the concept behind the
> name, or the name itself? I find it a lot easier to consider the name the
> resource of interest, and if someday my 'inflammable' is redefined to mean
> flammable, then my ontology will be exactly as wrong as all the books that
> used the 'old' definition of the word.  (At least until I redefine the word.
> Sure hope everyone is using timestamps. :->)
>
>   cc. Created relationships to 'most current' URIs persist even as the
>> semantics of that resource may change; this potentially introduces a time
>> quality to inferences done with these resources (e.g., "Today's New York
>> Times has an article on election polls" may be true statement today, but
>> false next week.)  Those who choose to use the 'most current' term will get
>> what they pay for.
>>
>
> You might be able to have programmatic or infrastructural capability which
> can return the 'most current' version of a given core term. There might be a
> URI/UID for the core term, and that is what would be accessed. There, a
> directive would be given that says please return the the most recent version
> of that item. This is a promising idea that could probably keep everyone
> happy.
>
>>    ee. Both the provided service, and ontology engines in general, must
>> be able to relate terms to their semantically identical historical
>> counterparts
>>
> When every version of every term has its own UID, then this becomes
> feasible, though it may also be an expensive overhead.
>
>>   ff. The service should be able to quickly identify/present to its users
>> each change in semantic meaning for a term.
>>
> Yes, and an application should also be able to subscribe to the core UID
> for a concept to be notified of any changes so it can keep up to date
> automatically in the case where the most uptodate version is wanted, and
> otherwise people can look into new versions on a case by case basis.
>
>
> Yes to all the above, and to the 'timestamps may be expensive' also.  I am
> worried about expense, but suspect I won't be able to tell for a while how
> resource-intensive this will be, and whether optimization will take care of
> it, and whether I still will be paid to "keep this problem solved."  But I
> plan as if I will...
>
> There may be some clear cut cases where you can tell which things are
>> static vs. dynamic. However IMHO, it is likely that a lot (perhaps most)
>> case will be dependent on the needs of the application, and the same concept
>> may be dynamic in some applications and static in others.
>>
> Maybe.  If I declare the static concept is forever unvarying by definition,
> I don't think it would be strategic for an application to assume otherwise.
>
>
>> I am less sanguine about this for predicates -- it seems like you're
>> allowing replacing the engine while the car is running.
>>
> I don't follow this analogy.
>
> I can imagine a future scenario where this is advantageous for predicates,
>> but it seems really inappropriate at this stage.
>>
> You have a strong intuition that I'm not able to grasp.  Can you articulate
> why with an example?
>
>
> OK, my examples uses 'sea surface temperature' as a subject, and 'sameAs'
> as a predicate. If, over time, the concept associated with 'sea surface
> temperature' evolves from "any measurement of any body of sea water within a
> meter or so of the ocean's surface" to "an informal reference to the concept
> of temperature near the ocean surface (deprecated as a reference to a
> particular measurement)", the tools I have written may produce some
> less-than-ideal inferences if they assume the new definition applies to old
> data, or vice-versa.  Even if the new definition in 100 years is
> "measurement of the temperature of the foam we keep on top of the  ocean to
> keep it cool", some inferences could be faulty, but the engine won't break
> down.
>
> But if I've originally used 'sameAs' in mappings to mean that two concepts
> are analogous in certain defined ways (maybe a faulty original practice, but
> go with it), and then the term is redefined by general consensus to mean
> "refers to the exact same resource", I have some really broken results,
> because a key piece right in the middle of my infrastructure has changed.
>  If you try to change important parts in a car while it's moving, bad things
> can happen, even if the new part is every bit as good as the old part. If we
> try to change the meaning of core terms used in semantic inferencing, then
> all the tools and things are likely to behave oddly during the change, if
> not afterwards as well.
>
> As to the multiple URIs for a single concept problem that was introduced in
>> (aa) above, I have both a justification and a backup plan.  The
>> justification is that the meaning of terms and their definitions is inferred
>> in a context, and changes to the context (the rest of the vocabulary) can
>> affect the implicit meaning, or usage, of a term that nominally wasn't
>> changed.
>>
>
> This is true, and the reason why terms/words in wordnet belong to multiple
> synsets. Each synset has a unique meaning, and in the owl dataset, each
> synset has its own URI. So I don't find your argument convincing.  Multiple
> context shows different uses of a term, so each use should get a different
> UID, not the same one.
>
>
> This is  a different context. Example below.
>
> So even if I haven't changed the explicit definition of a term in a new
>> vocabulary release, it is meaningful to consider this term a new resource,
>> and give it a new URI, to reflect its new context.
>>
> Maybe the wordnet example is a read herring.  In any event, can you
> provided a clear example of how an application would find it helpful to have
> whole new sets of URIs minted for identical things?
>
>
> Here's a simple example, before giving you a detailed domain-specific
> example: Let's say I review a vocabulary and change 80% of the definitions.
> But the remaining definitions are deemed good and remain unchanged.  By
> virtue of being part of a heavily reviewed vocabulary, these remaining
> original terms have gained credibility -- they are more reviewed and more
> trusted then they were before that version was created.
>
> For a domain example, let's go back to sea surface temperature.  5 years
> ago, it meant something like "any measurement of any body of sea water
> within a meter or so of the ocean's surface".  More recently, data managers
> realized that wasn't specific enough. So 5 new terms were created to
> precisely delineate the difference kinds of sea surface temperature.
>
> Now, if I get a set of data that uses some of these new terms to label
> variables, and also has an item labelled 'sea surface temperature', I can
> infer that the use of the broader variable meant that no more specific
> description could be provided.  Whereas in data from 5 years ago, I might
> replace the general term in many cases, by looking at other metadata to
> learn the more specific term. With the existence of the new terms, the old
> term has new connotations.
>
> Here is one example where it is clearly a bad thing.
> The application is ontology-driven at a deep level. It makes use of the
> resources in the coding/creation of application functionality. It also loads
> and makes use of data using the ontology.
> T1: application loads ontology using original terms.
> T2: application loads data expressed using the original terms
> T3: all new URIs are minted, when only a few have changed semantics, and
> there is no indication of which ones have new semantics and which have the
> same semantics.
>
>
> Well, this is bad but not unmanageable. Presumably a query of the 'before'
> and 'after' resources for those two concepts would reveal whether or not
> there are differences.  Or, presumably you can query the ontologies to get
> that info, even if you can't query the terms themselves. (Hmm, in today's
> semantic web a lot of times you don't have the original ontology versions
> either, do you?  But that would be another thing that breaks the system to
> some degree, you don't have any ability to validate previous inferences or
> see what it was like when the relationships were created, so you can't
> validate them independently. Sigh....)
>
> But in any case, I accept the challenge here and say again "it only works
> if the new URIs can say whether they are the same semantics as a previous
> version."  Otherwise, I agree it's a bad thing.
>
> T4: A new dataset is created which uses the new URIs
> T5: The application loads the new data
> T6: The application poses a query which uses the old URIs to filter data.
> T7; The new URIs do not match the old ones, so the query only returns data
> from the old URIs when it should return data from the new dataset as well.
>
> This is clearly a bad thing.   Your proposal has to argue advantages that
> offset the disadvantage here, in order for me to buy into it.
>
>
> One mitigation of disadvantages is obtained if most of the users map to the
> 'most recent version' (core concept) of the term, not specific versions. I
> suspect this will be likely.
>
> Of course, it is also very important to say this new resource has the same
>> definition and semantics as another, previous resource, preferably pointing
>> back to the original instance with that definition/semantics.
>>
>
> This creates an unnecessary burden and seems to contradict your point that
> something in a different context will have different semantics. If it has
> different semantics, then why point back to something with identical
> semantics?
>
>
> An excellent point.  (I'm busted!)  Apparently I differentiate between
> explicit meanings, which one finds in the term's resource description, and
> implicit meanings, which one finds in larger context. The version
> relationship primitives have to be understood to refer to the explicit
> meanings only. When the definition changes explicitly, that's a URI change
> that no longer can be considered exactly the same concept.
>
> I still can't see any advantages for creating multiple copies of exactly
> the same thing.
> Have I missed something?
>
>
> The practical advantage is the one introduced at the top -- I can consider
> and implement the vocabulary as a unit, carrying all of its components along
> with it.  Conceptually/abstractly I suspect this may be the right way to
> think of a vocabulary.
>
> But more practically, this gives me a trivial way to generate URIs for
> those terms, a trivial way to capture the contents of each new version of
> the ontology (otherwise I have to analyze every term to decide if it is
> different, right?), a trivial way to explain to the user what the URI for
> each term will look like, and a way to tell from the term URI which
> vocabulary it's a part of (not that I'd ever do that to an opaque URI...).
>
> but of course, I realize I have to go do some of these things latert, in
> any reasonable version of the system...I just don't have to do them
> *instantly*....
>
> I imagine we will have to create a relationship for our own use that has
>> this meaning for now.
>>
> We probably will need some new infrastructural primitives, to relate
> versions to each other.
>
>
> Just so.
>
> This is a practical solution which would probably be pretty easy when URIs
> are de-conflated with UIDs. Though proliferation of URIs for the same thing
> should be reduced whenever possible.
>
> See another thread I started on similar topic by googling
> ["proliferation of URIs" uschold]
>
>
> Excellent, I looked at the summary post and I see things with your level of
> concern, perhaps more than the responders. (Though I liked Tim's quote: "So
> multiple URIs for the same thing is life, a constant tradeoff, but life is,
> on balance good.")  I would be a relatively small scale offender for a
> while, but a bad example.
>
> I will leave it there, too long a post for sure.
>
> John
>
> [1] Our URI creation scheme is described at
> http://marinemetadata.org/apguides/ontprovidersguide/ontguideconstructinguris ,
> with other details in that web neighborhood.
>
>
Received on Monday, 3 November 2008 08:21:57 UTC