application/x-www-form-urlencoded from Øyvind B. Fredriksen on 2001-01-31 (www-html-editor@w3.org from January to March 2001)

From: Øyvind B. Fredriksen <OyvindBF@softinn.no>
Date: Wed, 31 Jan 2001 21:44:12 +0100
To: "'www-html-editor@w3.org'" <www-html-editor@w3.org>
Cc: "'ij@w3.org'" <ij@w3.org>
Message-ID: <85CD4C9325C6D311840E0004AC9301C61ADC7B@pat.softinn.no>

I would like to draw your attention to the HTML 4.01 specification, section
17.13.3 Processing form data, the subsection entitled
"application/x-www-form-urlencoded", the item numbered 1. I have some
problems with the passage "... then reserved characters are escaped as
described in [RFC1738], section 2.2: Non-alphanumeric characters are
replaced by `%HH', a percent sign and two hexadecimal digits representing
the ASCII code of the character."

At first, I interpreted the sentence following the colon as an explication
of "reserved characters are escaped as described in [RFC1738], section 2.2",
allowing the reader not to consult RFC1738 while still getting a correct and
complete specification. With this interpretation I read "Non-alphanumeric
characters are replaced by `%HH' ..." as "All non-alphanumeric characters
are replaced by `%HH' ...". In particular, I had expected the `+' characters
replacing the space characters to be replaced in this way.

However, I discovered that Microsoft Internet Explorer 5.5 did not replace
`+' characters. (Of course, this might be a bug ...) And then I consulted
RFC1738, section 2.2, which distinguishes between 4 sets of
characters/octets:
1. Those with no corresponding graphic US-ASCII (00-1F and 7F-FF)
2. The unsafe ones (" ", "<", ">", """, "#", "%", "{", "}", "|", "\", "^",
"~", "[", "]", and "`")
3. Those reserved for some other interpretation within the particular scheme
4. All other characters.
Set 3 (the reserved characters) is a subset of ";", "/", "?", ":", "@", "=",
"&". In this case, the "scheme" is HTTP, and all of these characters are
reserved (at least within queries). This means that set 4 (the "other"
characters) comprises the alphanumeric ones and "$", "-", "_", ".", "+",
"!", "*", "'", "(", ")" and ",".

RFC1738 states that
- The unsafe characters (1) and those with no corresponding graphic US-ASCII
(2) must be escaped.
- The reserved characters (3) must not be escaped when used for their
reserved purpose.
- The reserved characters (3) must be escaped otherwise.
- The other characters (4) may be escaped, but do not have to.
(We may assume that reserved characters in form data are not used for their
reserved purpose (in HTTP URIs), so they must always be escaped.)

So what is then meant by "reserved characters are escaped as described in
[RFC1738], section 2.2"? As the concept of a "reserved character" is not
defined in the HTML specification, it must be interpreted as defined in
RFC1738, section 2.2. But it is clearly not sufficient to escape just those
characters. At least the unsafe character "%" should be escaped. (Otherwise,
"%HH" could not be decoded uniquely when HH are hex digits.)

So, was my original interpretation correct after all (and Microsoft in
error), or should the HTML specification have read something like "... then
all non-alphanumeric characters except "$", "-", "_", ".", "+", "!", "*",
"'", "(", ")" and "," are escaped as described in [RFC1738], section 2.2:
They are replaced by ..."?

 - Øyvind

Øyvind Bolme Fredriksen
Software Innovation ASA
P.O.Box 100 Kokstad
N-5863  BERGEN  
NORWAY
Tel. (+47) 55 98 74 20
Fax (+47) 55 98 74 40

Received on Wednesday, 31 January 2001 15:46:08 UTC