Why IDs need to be Opaque Tokens - Long from Yaron Goland (Exchange) on 1999-09-13 (w3c-dist-auth@w3.org from July to September 1999)

From: Yaron Goland (Exchange) <yarong@Exchange.Microsoft.com>
Date: Mon, 13 Sep 1999 15:50:20 -0700
To: "'Greg Stein'" <gstein@lyra.org>, w3c-dist-auth@w3.org
Cc: "Jim Whitehead (E-mail)" <ejw@ics.uci.edu>, Henrik Frystyk Nielsen <henrikn@microsoft.com>
Message-ID: <078292D50C98D2118D090008C7E9C6A603C965EE@STAY.platinum.corp.microsoft.com>
[Ed. Note: This one is long but I think it provides a good description of
why putting useful information into IDs is almost always a bad idea.]

">" comments by Greg Stein:
> Database tables use column values all the time for primary 
> keys. You do
> *not* have to always create an "id" column. Yes, you usually 
> do when you
> have various foreign-key relationships (as part of the process of
> normalizing your database design), but there is *nothing* in database
> relational theory that requires the addition of an extra column simply
> to refer to rows in a database.

You are correct in that one can successfully perform relational manipulation
on systems without a dedicated row ID. In fact, one can do relational
manipulation without even having to have unique IDs.

In my experience there are two general ways in which database parsing code
is written:

Required Structure - In this type of code the code assumes that it knows
exactly what the structure of the database is, what all of its columns are,
etc. For example, imagine one has a car database. The database has only
three columns, car color, engine type and model year. The code is written to
assume this structure. Therefore the code assumes that a car can be uniquely
identified by car color, engine type and model year. To this end it
generates an ID made up of these three values. It then puts this ID on the
car's ID plates on various parts of the car so that if the car is stolen the
parts can be potentially identified.

Minimal Structure - In this type of code the code only assumes that a
certain set of columns will be present but makes no assumptions about what
columns may be added later. Given the same database as before this code
doesn't know what columns may be added in the future or what sorts of new
ways in which cars may be identified. As such this code requires that an
additional column be added to the data, an ID column which contains a unique
opaque ID. Each car is then identified by that ID rather than by some
combination of values. Thus the code does not necessarily assume that car
color/engine type/model year is sufficient to identify the car. The only
assumptions it makes are:
	1) A car is uniquely identified by the opaque ID
	2) The database, at minimum, will contain car color, engine type and
model year
Therefore the IDs generated by this system are just opaque tokens, they have
no inherent meaning and will be stuck all over the car for the same reasons
as stated before.

Now lets look at what happens when the world changes. For example, it turns
out that the car company, for the first time, will be producing cars which
have the same color, engine type and model year but different body types. So
the database needs to have a new column added to it for body type.

The "required structure" code will be completely broken. It has assumptions
all over the place that a car can be uniquely identified by the three
original values. It will no longer be able to generate useful IDs (since
they won't be unique in the new system) and the entire database system and
all systems which depend on that database system will have to be completely
retooled.

The "minimal structure" code won't have any problems at all. It never
assumed anything about what it takes to uniquely identify a car other than
some opaque ID. So one can add in a new column without problem. All the
existing code will continue to function and the supporting systems will
continue to work just fine. This isn't to say that changes won't have to be
made. Obviously the database screens will all have to be updated to include
the new data. However even if you didn't do this, for example in a search
screen, the result will just be that you will get a lot more results than
you were expecting.

This is why one should always keep one's row identifiers opaque.

One can argue that the real problem with the required structure is that its
original database was badly designed. What car could possibly be uniquely
identified by just car color, engine type and model year? The problem is
that we absolutely never get it right. No matter how smart we are, no matter
how hard we work, time always makes fools of us. That is the power of
systems like HTTP. It included versioning from day one (o.k., day two, but
close enough =). It assumed that it would get it wrong and left itself a way
out. 

Opaque IDs are the moral equivalent of versioning for database systems. One
writes one's code to only assume that a minimal set of values will be
present and make no assumptions regarding what may be added in the future.
For the minimal assumption to work one needs to have a unique ID. That way
if the combination of values which uniquely identify an entry should change
in the future one's code will continue to work.

Unique IDs aren't about getting V1 to work. They are about making sure that
V2, V3, etc. will work as well.

> 
> A URI is a unique value which contains useful information. If 
> I have no
> foreign key relationships or I don't want to normalize my tables, then
> why do I have to munge that into something else? That's bunk. Useful
> info can definitely be part of an identifier. I can cite examples :-)

It is very common to put useful information into URIs. The design principal
in HTTP has been that only servers are allowed to generate URIs. Clients may
ask in a PUT or similar request for a particular URI but it is up to the
server to decide if it wants to give out that URI. As we used to say in
WebDAV, the client proposes and the server disposes. Since the server
generated the URI and only the server could actually consume the URI it
seemed reasonable to allow the server to put useful state information into
the URI so as to simplify the implementation.

I think there are many scenarios where this is true. The reason this works
is that while the server may make assumptions about what the values in the
URI mean the clients do not. The client treats the URI as being completely
opaque. This means that if the server has to later change its policy about
the format or data inside the URI no one cares but the server. This strategy
isn't perfect of course. For example, if two servers decide to coordinate in
a farm to handle resources then they will need to be running the exact same
code so they can understand the information that has been put into the URI.
Another problem is that if the server does want to change its policy in
formatting the URIs then generally any existing URIs will be broken.

In practice neither of these problems have been a big enough issue to
overcome the processing simplicity offered by encoding data in the URI. This
is the old simplicity v. extensibility tradeoff. One almost always has to
trade one for the other.

In general dynamically generated URIs are not stored by the clients and if
multiple servers are cooperating they are all running the same software, so
the versioning issues don't really matter. 

The cross-server lock case is a good counter example. In this case two
servers running different software need to communicate and the clients are
most certainly recording the identifiers. That is one reason why it is such
a bad idea to put useful information into lock tokens. It is also a bad idea
for the reasons discussed in the previous section. The assumptions the code
will make about what is in the lock tokens will tend to "spill out" into the
rest of the server code and make the entire code base more fragile. By just
using an opaque token you don't have to worry about this fragility.

> I disagree with your assertion. Let's say that I have a repository
> (table) with two columns: a database-assigned unique serial number
> ("id"), and a binary blob. I use this table to store the 
> resources on my
> server. When I generate a locktoken, I want that token to specify the
> resource's "id" for efficiency purposes.

The problem is that if a server wants to move its lock over to a server
which encodes IDs into its lock tokens then the transfer will fail because
the first server's lock token format won't have the same information as the
second server's format. This is a great example of why it is a bad idea to
encode information. Just keep your identifiers opaque.

 
> If the WebDAV spec does not allow that, then it is broken. As it is
> currently written, it *does*, but it seems like you're trying to say
> that all locktokens should be uniform, regardless of repository. NFW.

WebDAV does allow you to do anything you want with a lock token.
Additionally I am not suggesting we have a uniform lock token format. I am
suggesting that we always treat lock tokens as opaque tokens with no
meaningful data other than a globally unique identifier. That way we can
share lock tokens.
 
> I don't see how you can assert that my server would be "badly written"
> because of this feature. What makes you believe that? I see no support
> for your position.

Hopefully the previous answers your question.
 
> Coordinating with other servers is beyond the scope of 
> WebDAV. Further,
> when/if this server-server communication begins to be designed, then
> you're going to need to deal with different locktoken designs simply
> because servers are going to want to have opaque/private locktokens. I
> know that I want custom locktokens.

I think it would be a real shame if Mod_DAV, which has shown great promise
as one of the key WebDAV platforms, would make a design decision whose
ramifications were to weaken its ability to be extended in the future and
effectively prevent it from being able to implement server-to-server
communication protocols. I hope, in light of the reasons I have listed
above, that you reconsider your decision and stick with an opaque model.

> Excellent point! I agree with you here. That leads me to believe that
> locks MUST NOT move with the resource since the Destination may not
> support the transfer of the locktoken (heck, what happens if the
> Destination doesn't support locking!).

In the short and probably medium term I agree with you that locks can not be
moved, although for very different reasons. Long term, however, if we are to
create a robust object manipulation system we must provide for the movement
of locks. We can't have a situation where resources are bound based on their
physical presence on a particular server. We need to make the place where
the bits reside irrelevant. People who are writing WebDAV servers today who
wish to be able to support this bright new future of inter-server
communication can prepare for that day by not making the mistake of encoding
useful information into their lock tokens. People who make the decision to
use opaque tokens can take comfort that their design choice will most likely
have also made their code more robust and easier to extend in the future.

> 
> However, I still disagree on your notion that locktokens cannot be
> defined by the server (to include useful info). And I 
> *really* disagree
> with your (unsupported) view that this implies the server is badly
> written.

The server can include whatever it wants in a lock token, I am simply
suggesting it shouldn't include anything but an opaque ID. Either way, I
hope the previous provides support for my view that a server that uses
non-opaque tokens will almost certainly not be able to interoperate with
other servers in the future and most likely has a very fragile code base
that will not be easily extensible.

> 
> Cheers,
> -g
> 
> --
> Greg Stein, http://www.lyra.org/
> 

		Yaron
Received on Monday, 13 September 1999 18:51:40 UTC