Re: [Need advice] When to decode '+' to ' '? from Mike Brown on 2007-07-21 (uri@w3.org from July 2007)

From: Mike Brown <mike@skew.org>
Date: Sat, 21 Jul 2007 13:40:00 -0600 (MDT)
To: Sebastian Pipping <webmaster@hartwork.org>
CC: uri@w3.org
Message-Id: <200707211940.l6LJe23E011479@chilled.skew.org>

I think you're confusing general percent-encoding in URIs with the rules for 
producing application/x-www-form-urlencoded data. They're related, but 
distinct.

In any case, sender and receiver must agree; if you (the receiver) know the 
data is of the application/x-www-form-urlencoded media type, then you should 
not be blindly applying the modern, general RFC 3986 percent-encoding rules to 
it to interpret it. You must decode it using the reverse of the encoding 
process.

As described in the HTML specs, such data is divided into "&"-separated 
"name=value" pairs, it uses "+" instead of "%20" for space, has had newlines 
normalized to "%0D%0A", and has had "non-alphanumeric"/"reserved" characters 
percent-encoded. This section of the specs predates HTML becoming 
Unicode-friendly, so there is a great deal of ambiguity in exactly which 
characters are percent-encoded and how, but in practice, implementations 
generally align with RFC 3986 when deciding which characters to encode.

So, to encode a set of name-value pairs (character data from an HTML form):

1. In each name and value, encode each CR, LF, or CR+LF to "%0D%0A".

2. In each name and value, encode each space as "+", and percent-encode any 
other character that won't be unambiguous in a URI, especially "+", "&", and 
"=".

3. Insert "=" between each name and value, and "&" between each pair.

To decode:

1. Obtain name-value pairs by splitting on each "&" to get a pair, and
then on "=" to separate the name from the value.

2. Replace each "+" with space and replace each percent-encoded sequence with 
its corresponding character. This may require the decoder to know what 
character encoding (e.g. ISO-8859-1 or UTF-8) was used as the basis for 
percent-encoding non-ASCII-range characters, which is often (in Web 
applications) just a guess.

3. If necessary, normalize CR+LF to the local newline convention.

After step 2 of the decode, any "+"s you end up with are actual plus signs,
not spaces.

Sebastian Pipping wrote:
> Say I have a percent-encoded URI and decode
> all the percent blocks. The result will
> still carry '+'s representing spaces sometimes.
> What do I do with that? How do I fully decode
> the URI without hurting '+'s that do not represent
> spaces. Also for the other way around: Should
> I always percent-encode '+'s to save them?
> I didn't find anything about it in RFC 3986.
> 
> Any help appreciated. Thanks in advance!
> 
> 
> 
> Sebastian
>

Received on Saturday, 21 July 2007 19:40:38 UTC