Re: Encoding URI From/To UTF-16 Questions from Mike Brown on 2005-03-27 (uri@w3.org from March 2005)

From: Mike Brown <mike@skew.org>
Date: Sun, 27 Mar 2005 13:09:40 -0700 (MST)
To: James Cerra <jfcst24_public@yahoo.com>
CC: Mike Brown <mike@skew.org>, uri@w3.org
Message-Id: <200503272009.j2RK9eme099278@chilled.skew.org>

James Cerra wrote:
> I think I undestand.  So say the program got the U+4F5B HAN IDEOGRAPH 
> character, and the user wants to use UTF-16 as the character encoding for the 
> bytes.  Then the program should:
> 
> 1) Convert input character to bytes 0x4F 0x5B (step 1).

Yes. Really, when representing character data in a URI, it is preferable to 
use UTF-8 at this stage.  Unfortunately this advice comes a bit late; web 
browsers submitting character-based HTML form data as 
application/x-www-form-urlencoded use the encoding of the page containing the 
form, and use numeric character references for unencodable characters. Binary
data (e.g. file upload) is, I believe, handled with the file bytes unchanged.

> 2a) Write the characters "%" "4" "F" for the first byte (step 2).
> or
> 2b) Write character "O" (letter "Oh") for first input byte (step 2).

Right; "O" is preferred, because %4F = "O" in ASCII, and that is in the 
reserved set.

> 3) Write characters "%" "5" "B" for second input byte (step 2).

Right, because 0x5B = "[" in ASCII, and that's not in the unreserved set.

> 4) Output string of characters (step 3).
> 
> So the function maps bytes to characters.  From there, the user can apply 
> whatever character encoding they wish to the character string.  Right?

Yes, whatever encoding is appropriate for transmission.

> The optional (2b) step confuses me.  The octet %4F is in the ALPHA range, 
> which the spec says should be converted to its ALPHA character [2].

The spec just means that since "O" is the normalized form of "%4F", they are 
interchangable and you can use either one, but since many applications out 
there don't perform normalization, it is preferable to use "%4F".

> Take the U+4F5B HAN IDEOGRAPH unicode character again, and the transcription 
> encoding UTF-16BE, for another example.  To get a URI fragment:
> 
> 1) Character is not unreserved so don't pass through.
> 2) Convert to bytes in UTF-16: 0x4F 0x5B.
> 3) Percent encode btyes: %4F %5B.

No, you don't want to start out with #1 there because then you're only 
converting certain characters to bytes first. You'd run into the ambiguity 
where U+004F = "O" or "%4F", and U+4F5B = "O%5B" or "%4F%5B", and the receiver 
won't know whether a given "O" or "%4F" is supposed to be U+004F or part of
a UTF-16 sequence. I gave reasons for that in the previous email.

The first algorithm is the correct one.

Characters -> bytes (UTF-8 preferred) -> unreserved chars and %xx sequences

or

Bytes (not representing characters) -> unreserved chars and %xx sequences

> Finally, is encoding "X" is what RDF 3986 (in section 2.5) calls 
> "data format encoding?"

No, the spec does explain -- "(e.g., a document charset)". Like, if you put 
your constructed URI into any kind of text document, that document will likely 
have some encoding of its own, so you need to make sure you're aware of that 
fact and aren't going to do something silly like (in Python):

# 2 Unicode strings
#
starttag = u'<location>'
endtag = u'</location>'

# an ASCII byte string
#
uri = 'http://host/path/to/some%20file'

# write out bytes
#
print starttag.encode('utf-16')
print uri  ## argh! mixing encodings!
print endtag.encode('utf-16')

Received on Sunday, 27 March 2005 20:09:42 UTC