- From: Mike Brown <mike@skew.org>
- Date: Tue, 20 Apr 2004 00:03:56 -0600 (MDT)
- To: uri@w3.org
I hate to post and then run off on vacation for 3 weeks, (I'm going to western Japan, seeing as much as I can from Ibusuki to Tokyo, if anyone wants to show me around...), and I am sorry to rehash this same old stuff, but... In order to implement and document RFC 2396bis related functions in Python, I am trying to write up an idiot-proof description of the process of creating URI components from arbitrary data. This is proving to be quite difficult, for even in draft 05, percent-encoding is still being explained in bits & pieces across multiple sections of the spec, and seems to be incomplete in terms of describing existing practice, prescribing recommendations, and accounting for all possible interpretations of the data. Here is how I would describe the process to somebody today, based on my understanding of draft 04, my attempts to implement it, and my recollection of what I've seen in practice. Please let me know where I got it wrong. I'm probably misunderstanding something again. --------------- Each URI component must ultimately be constructed as a sequence of Unicode characters. Therefore, in order to represent arbitrary data in a URI component, one must first determine whether the data is character based, and if so, whether it is a Unicode string or an encoded string. Then, the following guidelines apply: A. Non-character data ===================== When the data is not character based, it must be converted to either characters or octets, unless it is in octet form already. The spec governing the URI scheme or data format should mandate how the data is to be converted, but it may also be an implementation-dependent decision. If the data is converted to characters, then proceed according to (B) or (C), below. If the data is converted to or was already in octet form, then each octet becomes a percent-encoded sequence representing that octet. Example: here are the first 16 bytes of a GIF entity, in hex: 47 49 46 38 39 61 07 00 07 00 A2 00 00 00 00 00 Option 1: Convert the octets to percent-encoded sequences: %47%49%46%38%39%61%07%00%07%00%A2%00%00%00%00%00 Option 2: Convert to characters, such as with Base64: R0lGODlhBwAHAKIAAAAAAA This can then be treated as character data, according to (B) or (C), below. B. ISO/IEC 10646 (Unicode) character data ========================================= When the data consists of ISO/IEC 10646 (Unicode) characters, each character is handled as follows: 1. Each character that is in the reserved set but that is not being used for its reserved purpose becomes a percent-encoded sequence representing that character's octet in US-ASCII. 2. Each character between U+0000 and U+007F that is in the unreserved set either: a. does not change (the preferred outcome), or b. becomes a percent-encoded sequence representing the character's octet in the US-ASCII encoding. 3. Each character between U+0000 and U+007F that is in neither the reserved nor unreserved sets becomes a percent-encoded sequence representing that character's octet in US-ASCII. 4. Each character above U+007F becomes one or more percent-encoded sequences representing that character's octet(s) in UTF-8, unless a different encoding is mandated by the spec governing the URI scheme. For example, here's the Unicode string "greeting=" followed by an actual greeting in Japanese: U+0067 U+0072 U+0065 U+0065 U+0074 U+0069 U+006E U+0067 U+003D U+4ECA U+65E5 U+306F If the destination is the query component of a URI, and the intent is to use the "=" for its reserved purpose -- that is, to express, for example, in an 'http'-schemed URI a query argument consisting of the name "greeting" and the 3 Japanese characters U+4ECA U+65E5 U+306F as the argument's value, then one must first convert all but the "=" to UTF-8 octets: g r e e t i n g = U+4ECA U+65E5 U+306F 67 72 64 64 74 69 6E 67 (unchanged) E4 BB 8A E6 97 A5 E3 81 AF Then convert the octets to unreserved characters and percent-encoded sequences: greeting=%E4%BB%8A%E6%97%A5%E3%81%AF If, however, the "=" were not being used for its reserved purpose, then it would be percent-encoded: greeting%3D%E4%BB%8A%E6%97%A5%E3%81%AF In either case, the unreserved characters ("greeting") could become percent-encoded sequences without changing their meaning. For example, %67%72%64%64%74%69%6E%67=%E4%BB%8A%E6%97%A5%E3%81%AF is equivalent to greeting=%E4%BB%8A%E6%97%A5%E3%81%AF C. Encoded character data ========================= When the data consists of characters that have already been encoded as octets (i.e., any encoded character string, such as an iso-8859-1 or windows-1252 byte string), one of the following courses of action is chosen. The spec governing a URI scheme may mandate one course of action over the other, but in practice, it is generally left up to the implementation, a situation which leaves much room for misinterpretation of the URI component down the line. 1. The octets are treated opaquely: Each octet that, in US-ASCII, represents an unreserved character becomes that character (this is the preferred outcome) or becomes a percent-encoded sequence representing that octet. Any other octet becomes a percent-encoded sequence representing that octet. or... 2. The octets are treated as representing characters: Each octet in the data is decoded back into Unicode characters according to the known encoding of the data. If the encoding is not known, then UTF-8 should be assumed, but in practice, a platform default encoding is the common assumption. Such assumptions are only as reliable as the URI producer's knowledge of how the encoded string was generated. The Unicode sequence is then treated as in B, above. Example: A directory named "4 ÷ 3" on a non-Unicode filesystem might manifest as an encoded string like this (in hex): 34 20 F7 20 33 In the opaque scenario, the octets are converted to unreserved characters and percent-encoded sequences directly, resulting in 4%20%F7%203 In the decode-first scenario, the octets are converted to Unicode characters (possibly incorrectly, if the encoding isn't known). Then, following the procedure in (B), some of the characters become percent- encoded sequences, resulting in this sequence, assuming UTF-8 was the basis for the percent-encoding of the division sign: 4%20%C2%B7%203 ------------------ (Also note that the way I have described things, there is no need to specify that "%" needs to become "%25".) -Mike
Received on Tuesday, 20 April 2004 02:03:54 UTC