- From: Larry Masinter <masinter@parc.xerox.com>
- Date: Fri, 16 May 1997 17:02:03 PDT
- To: Koen Holtman <koen@win.tue.nl>
- Cc: http-wg@cuckoo.hpl.hp.com
# You need to spell out why decoding is worse than the other # alternatives. Koen, spelling this out is painful. It's part of the fundamentals of how network protocols are designed and implemented. I don't claim that network protocol design is my specialization, but rather that the reasons why decoding is worse than other alternatives is so commonplace that it *should* go without saying. --- It is common in many network protocols to have a value which is taken from an enumerated set of alternatives. There are a fixed set of choices, the sender designates a choice, and the recipient recognizes the choice as one established by the protocol. It is also common to have the enumerated set be extensible, either by revision of the protocol, a registration authority for new values, or a distributed registration method. In classic network protocols, enumerated values are often represented by incrementing bit patterns (e.g., "0" means 'turn device on' and "1" means 'turn device off'), or hierarchical ones (such as ISO object identifiers). However, in many Internet application protocols, enumerated values are written out as a sequence of ASCII characters, in order to simplify debugging (watching packet traces) and programming, e.g., printf("GET %s HTTP/1.1\n", url). Feature tags and PEP extension tags are instances of the general class of "extensible set of enumerated values". The idea that we might use URI space as a way of generating new elements of the extensible set of enumerated values in order to distribute the name space assignment is cute, but it doesn't change the fundamental nature of the tags as enumerated values and not general strings. The primary role of enumerated values in the implementation of recipients is to compare and dispatch. In some Internet protocols where strings are used for enumerated values, it is sometimes dictated that the strings will be case-normalized before they are actually compared; having to do so is an unfortunate drawback of using strings, with the ease of debugging and programming being the compensation. Merely because an enumerated value is represented in some protocols as an ASCII string does not mean that it should get the general treatment of "text". We do not allow hex-encoding inside "GET" and "PUT" and "POST" and we don't change them to other tokens, even if the client and servers are both configured for Chinese. We don't allow mail headers to be rewritten as "Nach:" or "Frċ:", but keep them "To:" and "From:". Note that sometimes enumerated values are used along with associated values to form a set of "name/value pairs", although the word "name" has confused many who are unfamiliar with network protocol design to believe that it denotes a personal name, such as "Larry" or "Dürst". In the case of feature tags, the feature tag itself is chosen from an enumerated set, but the associated value may (for some feature tags) indeed be text, and indeed require some amount of text normalization. This is similar to email headers, where the values of "To:" and "From:" might contain textual representations. The actual equivalence relationship of URLs (i.e., the decision as to whether two URLs actually locate the same resource) is scheme specific; admittedly, there are several heuristic subsets of the equivalence, e.g., de-encoding %XX hex for hex bytes that are known not to be 'reserved' for the scheme in which they appear, or case-folding the host name in ftp URLs), but those are clearly not the definitive equivalence relationship. That URLs (or, more specifically, URLs with a heuristic equivalence relationship) are not very good candidates for use in the role of a protocol's extensible set of enumerated values is not a particular criticism of URLs in general; URLs will not shine your shoes, URLs are not a good way to encode arbitrary programming constructs (even though it might be possible to write APL programs using %xx encoded UTF-8 representation of APL characters.) I hope you consider this an adequate "spelling out" of why applying decoding of %XX hex encodings to URLs is "bad"; if it isn't, I give up. Regards, Larry -- http://www.parc.xerox.com/masinter
Received on Friday, 16 May 1997 18:40:05 UTC