- From: Mike Brown <mike@skew.org>
- Date: Tue, 22 Feb 2005 20:15:17 -0700 (MST)
- To: www-forms@w3.org
Proposal:
Issue the following erratum to XForms 1.0 section 11.6:
In serialization step 2, replace "non-ASCII and reserved" with
"non-unreserved", and replace "amended" with "amended or superceded".
Alternatively...
In serialization step 2, replace "non-ASCII and reserved" with
"non-unreserved", replace "RFC 2396" with "RFC 3986", and replace
"amended" with "amended or superceded".
The proposed changes would change the behavior of conforming processors.
Rationale:
The definition of the application/x-www-form-urlencoded media type is found in
the HTML specs. The original definition was written for an ASCII-centric world
in which bytes & characters are interchangable concepts, and it
mischaracterizes the intent of the RFC it is based on. However, it has
obviously provided a good enough foundation that implementers have been able
to make good use of it.
So far, the only update to the underspecified
application/x-www-form-urlencoded media type since the publication of HTML 4
is found in XForms 1.0 sec. 11.6 (2003-10-14 W3C Rec.). The new definition is
the same as the old, except in these two ways:
(1) Instead of "non-alphanumeric" characters being percent-encoded,
it is now "non-ASCII" and "reserved (as defined by RFC 2396 as
amended by subsequent documents in the IETF track)"; and...
(2) Instead of using ASCII as the basis for percent-encoding, UTF-8 is
used instead. (This officially opens up the possibility of
percent-encoding any character).
There is no harm in specifying the use of UTF-8; in fact, it's generally a
good thing. But there are two problems with the change to which characters
need to be percent-encoded.
One is the phrasing "as amended by". A strict reading of this would be that
only "amendments" count, which means RFC 2732 (which adds "[" and "]" to the
reserved characters) is the only RFC 2396 update that qualifies. RFC 3986,
which renders RFC 2396 obsolete (it supercedes, not amends), would not be
applicable. I think the intent was to track whatever "reserved
character"-defining spec is current. If that's the case, then the phrasing
should be changed.
The other problem is more serious. RFC 2396 + RFC 2732 define these reserved
characters:
; / ? : @ & = + $ , [ ]
...and RFC 3986 defines these:
! * ' ( ) ; : @ & = + $ , / ? % # [ ]
If the set of characters to percent-encode is left defined as just "reserved +
non-ASCII", then there is a signficant number of ASCII-range characters that
will remain unescaped. For example, control characters, space, various
non-alphanums, and, most deadly, "%"! Thus the new definition of what
characters to percent-encode is now too narrow and needs to be expanded a bit
so that it is a bit closer to the original definition, which said that all
"non-alphanumeric" characters were to be escaped. My suggestion is to change
it to say that all characters except those defined as "unreserved" by RFC 2396
or its successors must be percent-encoded. Unreserved characters are those
that never need to percent-encoded, and that have "equivalent" semantics when
they are percent-encoded.
Since RFCs 2396 and 2732 are now obsolete, perhaps RFC 3986 should be provide
the definition of "unreserved" instead.
-Mike
Received on Wednesday, 23 February 2005 03:15:17 UTC