Re: Rewrite of feature tag syntax rules from Larry Masinter on 1997-05-17 (ietf-http-wg@w3.org from April to June 1997)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Fri, 16 May 1997 17:02:03 PDT
To: Koen Holtman <koen@win.tue.nl>
Cc: http-wg@cuckoo.hpl.hp.com
Message-Id: <337CF57B.5473@parc.xerox.com>
# You need to spell out why decoding is worse than the other
# alternatives.

Koen, spelling this out is painful. It's part of the fundamentals
of how network protocols are designed and implemented. I
don't claim that network protocol design is my specialization,
but rather that the reasons why decoding is worse than
other alternatives is so commonplace that it *should* go
without saying.
---
It is common in many network protocols to have a value which
is taken from an enumerated set of alternatives. There
are a fixed set of choices, the sender designates a choice,
and the recipient recognizes the choice as one established
by the protocol. It is also common to have the enumerated
set be extensible, either by revision of the protocol,
a registration authority for new values, or a distributed
registration method.

In classic network protocols, enumerated values are often
represented by incrementing bit patterns (e.g., "0" means
'turn device on' and "1" means 'turn device off'), or hierarchical
ones (such as ISO object identifiers). However, in many Internet
application protocols, enumerated values are written out as a 
sequence of ASCII characters, in order to simplify debugging
(watching packet traces) and programming, e.g.,
printf("GET %s HTTP/1.1\n", url).

Feature tags and PEP extension tags are instances of the
general class of "extensible set of enumerated values". The
idea that we might use URI space as a way of generating
new elements of the extensible set of enumerated values
in order to distribute the name space assignment is cute,
but it doesn't change the fundamental nature of the tags
as enumerated values and not general strings.

The primary role of enumerated values in the implementation
of recipients is to compare and dispatch. In some Internet
protocols where strings are used for enumerated values,
it is sometimes dictated that the strings will be case-normalized
before they are actually compared; having to do so is
an unfortunate drawback of using strings, with the ease
of debugging and programming being the compensation.

Merely because an enumerated value is represented in
some protocols as an ASCII string does not mean that
it should get the general treatment of "text".
We do not allow hex-encoding inside "GET" and "PUT" and "POST"
and we don't change them to other tokens, even if the client
and servers are both configured for Chinese.  We don't
allow mail headers to be rewritten as "Nach:" or "Frå:",
but keep them "To:" and "From:".

Note that sometimes enumerated values are used along
with associated values to form a set of "name/value pairs",
although the word "name" has confused many who are unfamiliar
with network protocol design to believe that it denotes
a personal name, such as "Larry" or "Dürst". In the
case of feature tags, the feature tag itself is
chosen from an enumerated set, but the associated value
may (for some feature tags) indeed be text, and indeed
require some amount of text normalization. This is
similar to email headers, where the values of "To:"
and "From:" might contain textual representations.

The actual equivalence relationship of URLs (i.e., the 
decision as to whether two URLs actually locate the
same resource) is scheme specific; admittedly, there
are several heuristic subsets of the equivalence, e.g.,
de-encoding %XX hex for hex bytes that are known not
to be 'reserved' for the scheme in which they appear,
or case-folding the host name in ftp URLs), but those
are clearly not the definitive equivalence relationship.

That URLs (or, more specifically, URLs with
a heuristic equivalence relationship) are not very
good candidates for use in the role of a protocol's
extensible set of enumerated values is not a particular
criticism of URLs in general; URLs will not shine your
shoes, URLs are not a good way to encode arbitrary
programming constructs (even though it might
be possible to write APL programs using %xx
encoded UTF-8 representation of APL characters.)

I hope you consider this an adequate "spelling out"
of why applying decoding of %XX hex encodings to
URLs is "bad"; if it isn't, I give up.

Regards,

Larry
--
http://www.parc.xerox.com/masinter
Received on Friday, 16 May 1997 18:40:05 UTC