- From: Yaron Goland (Exchange) <yarong@Exchange.Microsoft.com>
- Date: Mon, 13 Sep 1999 15:50:20 -0700
- To: "'Greg Stein'" <gstein@lyra.org>, w3c-dist-auth@w3.org
- Cc: "Jim Whitehead (E-mail)" <ejw@ics.uci.edu>, Henrik Frystyk Nielsen <henrikn@microsoft.com>
[Ed. Note: This one is long but I think it provides a good description of why putting useful information into IDs is almost always a bad idea.] ">" comments by Greg Stein: > Database tables use column values all the time for primary > keys. You do > *not* have to always create an "id" column. Yes, you usually > do when you > have various foreign-key relationships (as part of the process of > normalizing your database design), but there is *nothing* in database > relational theory that requires the addition of an extra column simply > to refer to rows in a database. You are correct in that one can successfully perform relational manipulation on systems without a dedicated row ID. In fact, one can do relational manipulation without even having to have unique IDs. In my experience there are two general ways in which database parsing code is written: Required Structure - In this type of code the code assumes that it knows exactly what the structure of the database is, what all of its columns are, etc. For example, imagine one has a car database. The database has only three columns, car color, engine type and model year. The code is written to assume this structure. Therefore the code assumes that a car can be uniquely identified by car color, engine type and model year. To this end it generates an ID made up of these three values. It then puts this ID on the car's ID plates on various parts of the car so that if the car is stolen the parts can be potentially identified. Minimal Structure - In this type of code the code only assumes that a certain set of columns will be present but makes no assumptions about what columns may be added later. Given the same database as before this code doesn't know what columns may be added in the future or what sorts of new ways in which cars may be identified. As such this code requires that an additional column be added to the data, an ID column which contains a unique opaque ID. Each car is then identified by that ID rather than by some combination of values. Thus the code does not necessarily assume that car color/engine type/model year is sufficient to identify the car. The only assumptions it makes are: 1) A car is uniquely identified by the opaque ID 2) The database, at minimum, will contain car color, engine type and model year Therefore the IDs generated by this system are just opaque tokens, they have no inherent meaning and will be stuck all over the car for the same reasons as stated before. Now lets look at what happens when the world changes. For example, it turns out that the car company, for the first time, will be producing cars which have the same color, engine type and model year but different body types. So the database needs to have a new column added to it for body type. The "required structure" code will be completely broken. It has assumptions all over the place that a car can be uniquely identified by the three original values. It will no longer be able to generate useful IDs (since they won't be unique in the new system) and the entire database system and all systems which depend on that database system will have to be completely retooled. The "minimal structure" code won't have any problems at all. It never assumed anything about what it takes to uniquely identify a car other than some opaque ID. So one can add in a new column without problem. All the existing code will continue to function and the supporting systems will continue to work just fine. This isn't to say that changes won't have to be made. Obviously the database screens will all have to be updated to include the new data. However even if you didn't do this, for example in a search screen, the result will just be that you will get a lot more results than you were expecting. This is why one should always keep one's row identifiers opaque. One can argue that the real problem with the required structure is that its original database was badly designed. What car could possibly be uniquely identified by just car color, engine type and model year? The problem is that we absolutely never get it right. No matter how smart we are, no matter how hard we work, time always makes fools of us. That is the power of systems like HTTP. It included versioning from day one (o.k., day two, but close enough =). It assumed that it would get it wrong and left itself a way out. Opaque IDs are the moral equivalent of versioning for database systems. One writes one's code to only assume that a minimal set of values will be present and make no assumptions regarding what may be added in the future. For the minimal assumption to work one needs to have a unique ID. That way if the combination of values which uniquely identify an entry should change in the future one's code will continue to work. Unique IDs aren't about getting V1 to work. They are about making sure that V2, V3, etc. will work as well. > > A URI is a unique value which contains useful information. If > I have no > foreign key relationships or I don't want to normalize my tables, then > why do I have to munge that into something else? That's bunk. Useful > info can definitely be part of an identifier. I can cite examples :-) It is very common to put useful information into URIs. The design principal in HTTP has been that only servers are allowed to generate URIs. Clients may ask in a PUT or similar request for a particular URI but it is up to the server to decide if it wants to give out that URI. As we used to say in WebDAV, the client proposes and the server disposes. Since the server generated the URI and only the server could actually consume the URI it seemed reasonable to allow the server to put useful state information into the URI so as to simplify the implementation. I think there are many scenarios where this is true. The reason this works is that while the server may make assumptions about what the values in the URI mean the clients do not. The client treats the URI as being completely opaque. This means that if the server has to later change its policy about the format or data inside the URI no one cares but the server. This strategy isn't perfect of course. For example, if two servers decide to coordinate in a farm to handle resources then they will need to be running the exact same code so they can understand the information that has been put into the URI. Another problem is that if the server does want to change its policy in formatting the URIs then generally any existing URIs will be broken. In practice neither of these problems have been a big enough issue to overcome the processing simplicity offered by encoding data in the URI. This is the old simplicity v. extensibility tradeoff. One almost always has to trade one for the other. In general dynamically generated URIs are not stored by the clients and if multiple servers are cooperating they are all running the same software, so the versioning issues don't really matter. The cross-server lock case is a good counter example. In this case two servers running different software need to communicate and the clients are most certainly recording the identifiers. That is one reason why it is such a bad idea to put useful information into lock tokens. It is also a bad idea for the reasons discussed in the previous section. The assumptions the code will make about what is in the lock tokens will tend to "spill out" into the rest of the server code and make the entire code base more fragile. By just using an opaque token you don't have to worry about this fragility. > I disagree with your assertion. Let's say that I have a repository > (table) with two columns: a database-assigned unique serial number > ("id"), and a binary blob. I use this table to store the > resources on my > server. When I generate a locktoken, I want that token to specify the > resource's "id" for efficiency purposes. The problem is that if a server wants to move its lock over to a server which encodes IDs into its lock tokens then the transfer will fail because the first server's lock token format won't have the same information as the second server's format. This is a great example of why it is a bad idea to encode information. Just keep your identifiers opaque. > If the WebDAV spec does not allow that, then it is broken. As it is > currently written, it *does*, but it seems like you're trying to say > that all locktokens should be uniform, regardless of repository. NFW. WebDAV does allow you to do anything you want with a lock token. Additionally I am not suggesting we have a uniform lock token format. I am suggesting that we always treat lock tokens as opaque tokens with no meaningful data other than a globally unique identifier. That way we can share lock tokens. > I don't see how you can assert that my server would be "badly written" > because of this feature. What makes you believe that? I see no support > for your position. Hopefully the previous answers your question. > Coordinating with other servers is beyond the scope of > WebDAV. Further, > when/if this server-server communication begins to be designed, then > you're going to need to deal with different locktoken designs simply > because servers are going to want to have opaque/private locktokens. I > know that I want custom locktokens. I think it would be a real shame if Mod_DAV, which has shown great promise as one of the key WebDAV platforms, would make a design decision whose ramifications were to weaken its ability to be extended in the future and effectively prevent it from being able to implement server-to-server communication protocols. I hope, in light of the reasons I have listed above, that you reconsider your decision and stick with an opaque model. > Excellent point! I agree with you here. That leads me to believe that > locks MUST NOT move with the resource since the Destination may not > support the transfer of the locktoken (heck, what happens if the > Destination doesn't support locking!). In the short and probably medium term I agree with you that locks can not be moved, although for very different reasons. Long term, however, if we are to create a robust object manipulation system we must provide for the movement of locks. We can't have a situation where resources are bound based on their physical presence on a particular server. We need to make the place where the bits reside irrelevant. People who are writing WebDAV servers today who wish to be able to support this bright new future of inter-server communication can prepare for that day by not making the mistake of encoding useful information into their lock tokens. People who make the decision to use opaque tokens can take comfort that their design choice will most likely have also made their code more robust and easier to extend in the future. > > However, I still disagree on your notion that locktokens cannot be > defined by the server (to include useful info). And I > *really* disagree > with your (unsupported) view that this implies the server is badly > written. The server can include whatever it wants in a lock token, I am simply suggesting it shouldn't include anything but an opaque ID. Either way, I hope the previous provides support for my view that a server that uses non-opaque tokens will almost certainly not be able to interoperate with other servers in the future and most likely has a very fragile code base that will not be easily extensible. > > Cheers, > -g > > -- > Greg Stein, http://www.lyra.org/ > Yaron
Received on Monday, 13 September 1999 18:51:40 UTC