Proposed XForms 1.0 erratum for application/x-www-form-urlencoded from Mike Brown on 2005-02-23 (www-forms@w3.org from February 2005)

From: Mike Brown <mike@skew.org>
Date: Tue, 22 Feb 2005 20:15:17 -0700 (MST)
To: www-forms@w3.org
Message-Id: <200502230315.j1N3FHGh017716@chilled.skew.org>
Proposal:

Issue the following erratum to XForms 1.0 section 11.6:

  In serialization step 2, replace "non-ASCII and reserved" with
  "non-unreserved", and replace "amended" with "amended or superceded".

  Alternatively...

  In serialization step 2, replace "non-ASCII and reserved" with
  "non-unreserved", replace "RFC 2396" with "RFC 3986", and replace
  "amended" with "amended or superceded".


The proposed changes would change the behavior of conforming processors.


Rationale:

The definition of the application/x-www-form-urlencoded media type is found in 
the HTML specs. The original definition was written for an ASCII-centric world 
in which bytes & characters are interchangable concepts, and it 
mischaracterizes the intent of the RFC it is based on. However, it has 
obviously provided a good enough foundation that implementers have been able 
to make good use of it.

So far, the only update to the underspecified 
application/x-www-form-urlencoded media type since the publication of HTML 4 
is found in XForms 1.0 sec. 11.6 (2003-10-14 W3C Rec.). The new definition is 
the same as the old, except in these two ways:

  (1) Instead of "non-alphanumeric" characters being percent-encoded,
      it is now "non-ASCII" and "reserved (as defined by RFC 2396 as
      amended by subsequent documents in the IETF track)"; and...

  (2) Instead of using ASCII as the basis for percent-encoding, UTF-8 is
      used instead. (This officially opens up the possibility of
      percent-encoding any character).

There is no harm in specifying the use of UTF-8; in fact, it's generally a 
good thing. But there are two problems with the change to which characters 
need to be percent-encoded.

One is the phrasing "as amended by". A strict reading of this would be that 
only "amendments" count, which means RFC 2732 (which adds "[" and "]" to the 
reserved characters) is the only RFC 2396 update that qualifies. RFC 3986, 
which renders RFC 2396 obsolete (it supercedes, not amends), would not be 
applicable. I think the intent was to track whatever "reserved 
character"-defining spec is current. If that's the case, then the phrasing 
should be changed.

The other problem is more serious. RFC 2396 + RFC 2732 define these reserved 
characters:

  ; / ? : @ & = + $ , [ ]

...and RFC 3986 defines these:

  ! * ' ( ) ; : @ & = + $ , / ? % # [ ]
  
If the set of characters to percent-encode is left defined as just "reserved + 
non-ASCII", then there is a signficant number of ASCII-range characters that 
will remain unescaped. For example, control characters, space, various 
non-alphanums, and, most deadly, "%"! Thus the new definition of what 
characters to percent-encode is now too narrow and needs to be expanded a bit 
so that it is a bit closer to the original definition, which said that all 
"non-alphanumeric" characters were to be escaped. My suggestion is to change 
it to say that all characters except those defined as "unreserved" by RFC 2396 
or its successors must be percent-encoded. Unreserved characters are those 
that never need to percent-encoded, and that have "equivalent" semantics when 
they are percent-encoded.

Since RFCs 2396 and 2732 are now obsolete, perhaps RFC 3986 should be provide 
the definition of "unreserved" instead.

-Mike
Received on Wednesday, 23 February 2005 03:15:17 UTC