- From: Mike Brown <mike@skew.org>
- Date: Sun, 27 Mar 2005 13:09:40 -0700 (MST)
- To: James Cerra <jfcst24_public@yahoo.com>
- CC: Mike Brown <mike@skew.org>, uri@w3.org
James Cerra wrote: > I think I undestand. So say the program got the U+4F5B HAN IDEOGRAPH > character, and the user wants to use UTF-16 as the character encoding for the > bytes. Then the program should: > > 1) Convert input character to bytes 0x4F 0x5B (step 1). Yes. Really, when representing character data in a URI, it is preferable to use UTF-8 at this stage. Unfortunately this advice comes a bit late; web browsers submitting character-based HTML form data as application/x-www-form-urlencoded use the encoding of the page containing the form, and use numeric character references for unencodable characters. Binary data (e.g. file upload) is, I believe, handled with the file bytes unchanged. > 2a) Write the characters "%" "4" "F" for the first byte (step 2). > or > 2b) Write character "O" (letter "Oh") for first input byte (step 2). Right; "O" is preferred, because %4F = "O" in ASCII, and that is in the reserved set. > 3) Write characters "%" "5" "B" for second input byte (step 2). Right, because 0x5B = "[" in ASCII, and that's not in the unreserved set. > 4) Output string of characters (step 3). > > So the function maps bytes to characters. From there, the user can apply > whatever character encoding they wish to the character string. Right? Yes, whatever encoding is appropriate for transmission. > The optional (2b) step confuses me. The octet %4F is in the ALPHA range, > which the spec says should be converted to its ALPHA character [2]. The spec just means that since "O" is the normalized form of "%4F", they are interchangable and you can use either one, but since many applications out there don't perform normalization, it is preferable to use "%4F". > Take the U+4F5B HAN IDEOGRAPH unicode character again, and the transcription > encoding UTF-16BE, for another example. To get a URI fragment: > > 1) Character is not unreserved so don't pass through. > 2) Convert to bytes in UTF-16: 0x4F 0x5B. > 3) Percent encode btyes: %4F %5B. No, you don't want to start out with #1 there because then you're only converting certain characters to bytes first. You'd run into the ambiguity where U+004F = "O" or "%4F", and U+4F5B = "O%5B" or "%4F%5B", and the receiver won't know whether a given "O" or "%4F" is supposed to be U+004F or part of a UTF-16 sequence. I gave reasons for that in the previous email. The first algorithm is the correct one. Characters -> bytes (UTF-8 preferred) -> unreserved chars and %xx sequences or Bytes (not representing characters) -> unreserved chars and %xx sequences > Finally, is encoding "X" is what RDF 3986 (in section 2.5) calls > "data format encoding?" No, the spec does explain -- "(e.g., a document charset)". Like, if you put your constructed URI into any kind of text document, that document will likely have some encoding of its own, so you need to make sure you're aware of that fact and aren't going to do something silly like (in Python): # 2 Unicode strings # starttag = u'<location>' endtag = u'</location>' # an ASCII byte string # uri = 'http://host/path/to/some%20file' # write out bytes # print starttag.encode('utf-16') print uri ## argh! mixing encodings! print endtag.encode('utf-16')
Received on Sunday, 27 March 2005 20:09:42 UTC