- From: Mike Brown <mike@skew.org>
- Date: Sat, 21 Jul 2007 13:40:00 -0600 (MDT)
- To: Sebastian Pipping <webmaster@hartwork.org>
- CC: uri@w3.org
I think you're confusing general percent-encoding in URIs with the rules for producing application/x-www-form-urlencoded data. They're related, but distinct. In any case, sender and receiver must agree; if you (the receiver) know the data is of the application/x-www-form-urlencoded media type, then you should not be blindly applying the modern, general RFC 3986 percent-encoding rules to it to interpret it. You must decode it using the reverse of the encoding process. As described in the HTML specs, such data is divided into "&"-separated "name=value" pairs, it uses "+" instead of "%20" for space, has had newlines normalized to "%0D%0A", and has had "non-alphanumeric"/"reserved" characters percent-encoded. This section of the specs predates HTML becoming Unicode-friendly, so there is a great deal of ambiguity in exactly which characters are percent-encoded and how, but in practice, implementations generally align with RFC 3986 when deciding which characters to encode. So, to encode a set of name-value pairs (character data from an HTML form): 1. In each name and value, encode each CR, LF, or CR+LF to "%0D%0A". 2. In each name and value, encode each space as "+", and percent-encode any other character that won't be unambiguous in a URI, especially "+", "&", and "=". 3. Insert "=" between each name and value, and "&" between each pair. To decode: 1. Obtain name-value pairs by splitting on each "&" to get a pair, and then on "=" to separate the name from the value. 2. Replace each "+" with space and replace each percent-encoded sequence with its corresponding character. This may require the decoder to know what character encoding (e.g. ISO-8859-1 or UTF-8) was used as the basis for percent-encoding non-ASCII-range characters, which is often (in Web applications) just a guess. 3. If necessary, normalize CR+LF to the local newline convention. After step 2 of the decode, any "+"s you end up with are actual plus signs, not spaces. Sebastian Pipping wrote: > Say I have a percent-encoded URI and decode > all the percent blocks. The result will > still carry '+'s representing spaces sometimes. > What do I do with that? How do I fully decode > the URI without hurting '+'s that do not represent > spaces. Also for the other way around: Should > I always percent-encode '+'s to save them? > I didn't find anything about it in RFC 3986. > > Any help appreciated. Thanks in advance! > > > > Sebastian >
Received on Saturday, 21 July 2007 19:40:38 UTC