- From: Øyvind B. Fredriksen <OyvindBF@softinn.no>
- Date: Wed, 31 Jan 2001 21:44:12 +0100
- To: "'www-html-editor@w3.org'" <www-html-editor@w3.org>
- Cc: "'ij@w3.org'" <ij@w3.org>
I would like to draw your attention to the HTML 4.01 specification, section 17.13.3 Processing form data, the subsection entitled "application/x-www-form-urlencoded", the item numbered 1. I have some problems with the passage "... then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character." At first, I interpreted the sentence following the colon as an explication of "reserved characters are escaped as described in [RFC1738], section 2.2", allowing the reader not to consult RFC1738 while still getting a correct and complete specification. With this interpretation I read "Non-alphanumeric characters are replaced by `%HH' ..." as "All non-alphanumeric characters are replaced by `%HH' ...". In particular, I had expected the `+' characters replacing the space characters to be replaced in this way. However, I discovered that Microsoft Internet Explorer 5.5 did not replace `+' characters. (Of course, this might be a bug ...) And then I consulted RFC1738, section 2.2, which distinguishes between 4 sets of characters/octets: 1. Those with no corresponding graphic US-ASCII (00-1F and 7F-FF) 2. The unsafe ones (" ", "<", ">", """, "#", "%", "{", "}", "|", "\", "^", "~", "[", "]", and "`") 3. Those reserved for some other interpretation within the particular scheme 4. All other characters. Set 3 (the reserved characters) is a subset of ";", "/", "?", ":", "@", "=", "&". In this case, the "scheme" is HTTP, and all of these characters are reserved (at least within queries). This means that set 4 (the "other" characters) comprises the alphanumeric ones and "$", "-", "_", ".", "+", "!", "*", "'", "(", ")" and ",". RFC1738 states that - The unsafe characters (1) and those with no corresponding graphic US-ASCII (2) must be escaped. - The reserved characters (3) must not be escaped when used for their reserved purpose. - The reserved characters (3) must be escaped otherwise. - The other characters (4) may be escaped, but do not have to. (We may assume that reserved characters in form data are not used for their reserved purpose (in HTTP URIs), so they must always be escaped.) So what is then meant by "reserved characters are escaped as described in [RFC1738], section 2.2"? As the concept of a "reserved character" is not defined in the HTML specification, it must be interpreted as defined in RFC1738, section 2.2. But it is clearly not sufficient to escape just those characters. At least the unsafe character "%" should be escaped. (Otherwise, "%HH" could not be decoded uniquely when HH are hex digits.) So, was my original interpretation correct after all (and Microsoft in error), or should the HTML specification have read something like "... then all non-alphanumeric characters except "$", "-", "_", ".", "+", "!", "*", "'", "(", ")" and "," are escaped as described in [RFC1738], section 2.2: They are replaced by ..."? - Øyvind Øyvind Bolme Fredriksen Software Innovation ASA P.O.Box 100 Kokstad N-5863 BERGEN NORWAY Tel. (+47) 55 98 74 20 Fax (+47) 55 98 74 40
Received on Wednesday, 31 January 2001 15:46:08 UTC