strangeness in definition of x-www-form-urlencoded from Adam M. Costello on 2003-09-22 (www-html-editor@w3.org from July to September 2003)

From: Adam M. Costello <amc+e9hp4g@nicemice.net>
Date: Mon, 22 Sep 2003 00:34:01 +0000
To: www-html-editor@w3.org
Message-ID: <20030922003401.GA26757@nicemice.net>

I am concerned about the definition of
application/x-www-form-urlencoded.  HTML 2.0 and HTML 4.01 both say:

    space characters are replaced by `+', and then reserved characters
    are escaped as described in RFC 1738: non-alphanumeric characters
    are replaced by `%HH'...

Which is it, reserved characters or non-alphanumeric characters?  Either
way, the specified process is not reversible, because it perfoms %HH
escaping *after* changing spaces to plus-signs.  For example, the values
"foo+bar" and "foo bar" map to the same thing, either "foo+bar" (if
plus-sign is not escaped), or "foo%2Bbar" (if plus-sign is escaped).

As far as I know, browsers always violate the spec and do something
reversible instead: they do the %HH escaping *before* changing spaces
to plus-signs, and they include plus-sign in the set of characters to
be escaped.  That way, the server can distinguish between "foo%2Bbar"
(which means "foo+bar") versus "foo+bar" (which means "foo bar").

Am I correctly understanding the spec, that the specified encoding is
non-reversible?  Is my observation about browsers accurate, that in
practice they always use a reversible encoding?  Should this discrepancy
be addressed in some W3C note?

The XForms draft resolves the reserved/non-alphanumeric question, but
retains the non-reversibility:

    space characters are replaced by +, and then non-ASCII and reserved
    characters (as defined by [RFC 2396] as amended by subsequent
    documents in the IETF track) are escaped by replacing the character
    with one or more octets of the UTF-8 representation of the
    character, with each octet in turn replaced by %HH...

AMC

Received on Sunday, 21 September 2003 20:37:55 UTC