Re: URLencoding.

>Dave J Woolley wrote:
[snip]
> "Space characters are replaced by `+', and
> then reserved characters are escaped as described in [RFC1738],
section
       ^^^^^^^^ ^^^^^^^^^^
>2.2:
>        Non-alphanumeric characters are replaced by `%HH', a percent
>sign and
> two hexadecimal digits representing the ASCII code of the character.
> Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
>
> -- 17.13.4 Form content types
>http://www.w3.org/TR/1999/REC-html401-19991224/interact/forms.html#h-17
.13.4.1



>That's clear enough, no?
> 0. convert mac/unix/whatever linebreak conventions to internet CRLF
> if necessary
> 1. replace all ' ' by +
> 2. replace everything but alphanumerics [a-zA-Z0-9] by %HH


Ahh...But therein lies the confusion...

Set aside the question of whether or not you % escape the + with which
you replaced the spaces (first point of confusion).

If you read the referenced RFC1738, you discover that the term "reserved
characters" has a special meaning. Incidentally, the RFC1738 reference
is out of date since it has been superceded by RFC2396.

From RFC2396, section 2.2. Reserved Characters:
...
  reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                "$" | ","
...

From section 2.3. Unreserved Characters:
...
  unreserved  = alphanum | mark

  mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

  Unreserved characters can be escaped without changing the semantics
  of the URI, but this should not be done unless the URI is being used
                  ^^^^^^^^^^^^^^^^^^^^^^^
  in a context that does not allow the unescaped character to appear.
...

So now we have the second point of confusion--should the "mark"
characters be escaped or not. If yes, eliminate the reference to the RFC
and simply escape all non-alphanumeric characters. If no, reword the
text to make clear what is meant by reserved characters. Since the RFC
is permissive, there is no problem if the first choice is made, but in
either case section 17.13.4 should be clarified.


--Dave Bridger

Received on Sunday, 9 April 2000 22:45:42 UTC