- From: Mike Brown <mike@skew.org>
- Date: Tue, 22 Feb 2005 20:15:17 -0700 (MST)
- To: www-forms@w3.org
Proposal: Issue the following erratum to XForms 1.0 section 11.6: In serialization step 2, replace "non-ASCII and reserved" with "non-unreserved", and replace "amended" with "amended or superceded". Alternatively... In serialization step 2, replace "non-ASCII and reserved" with "non-unreserved", replace "RFC 2396" with "RFC 3986", and replace "amended" with "amended or superceded". The proposed changes would change the behavior of conforming processors. Rationale: The definition of the application/x-www-form-urlencoded media type is found in the HTML specs. The original definition was written for an ASCII-centric world in which bytes & characters are interchangable concepts, and it mischaracterizes the intent of the RFC it is based on. However, it has obviously provided a good enough foundation that implementers have been able to make good use of it. So far, the only update to the underspecified application/x-www-form-urlencoded media type since the publication of HTML 4 is found in XForms 1.0 sec. 11.6 (2003-10-14 W3C Rec.). The new definition is the same as the old, except in these two ways: (1) Instead of "non-alphanumeric" characters being percent-encoded, it is now "non-ASCII" and "reserved (as defined by RFC 2396 as amended by subsequent documents in the IETF track)"; and... (2) Instead of using ASCII as the basis for percent-encoding, UTF-8 is used instead. (This officially opens up the possibility of percent-encoding any character). There is no harm in specifying the use of UTF-8; in fact, it's generally a good thing. But there are two problems with the change to which characters need to be percent-encoded. One is the phrasing "as amended by". A strict reading of this would be that only "amendments" count, which means RFC 2732 (which adds "[" and "]" to the reserved characters) is the only RFC 2396 update that qualifies. RFC 3986, which renders RFC 2396 obsolete (it supercedes, not amends), would not be applicable. I think the intent was to track whatever "reserved character"-defining spec is current. If that's the case, then the phrasing should be changed. The other problem is more serious. RFC 2396 + RFC 2732 define these reserved characters: ; / ? : @ & = + $ , [ ] ...and RFC 3986 defines these: ! * ' ( ) ; : @ & = + $ , / ? % # [ ] If the set of characters to percent-encode is left defined as just "reserved + non-ASCII", then there is a signficant number of ASCII-range characters that will remain unescaped. For example, control characters, space, various non-alphanums, and, most deadly, "%"! Thus the new definition of what characters to percent-encode is now too narrow and needs to be expanded a bit so that it is a bit closer to the original definition, which said that all "non-alphanumeric" characters were to be escaped. My suggestion is to change it to say that all characters except those defined as "unreserved" by RFC 2396 or its successors must be percent-encoded. Unreserved characters are those that never need to percent-encoded, and that have "equivalent" semantics when they are percent-encoded. Since RFCs 2396 and 2732 are now obsolete, perhaps RFC 3986 should be provide the definition of "unreserved" instead. -Mike
Received on Wednesday, 23 February 2005 03:15:17 UTC