Re: Encoding URI From/To UTF-16 Questions from James Cerra on 2005-03-27 (uri@w3.org from March 2005)

From: James Cerra <jfcst24_public@yahoo.com>
Date: Sun, 27 Mar 2005 01:45:29 -0800 (PST)
To: Mike Brown <mike@skew.org>
Cc: uri@w3.org
Message-ID: <20050327094529.50995.qmail@web42206.mail.yahoo.com>
Mike,

Thanks for the explanation.  It was incredibly enlightening, but I'm still a 
little confused.

> Then figure out how best to map them URI characters. This really depends on 
> what URI scheme you are producing for, and/or what kind of data you are 
> representing...
> ...so your attempt to make a generic API for this is going to be a general 
> solution that may require the caller to know what they are doing and only 
> call your code when it is really needed.

This is part of the project's specs.  I have to extend a RDF producing 
framework [1] with string processing functions.  The strings produced are 
concatenated and eventually become an URI.  The author of the URI generally 
will know which sections need to be percent-encoded.  I'm simply writing the 
method to perform that encoding. So when you wrote:

] But let's just say for now that in your API you know you're starting with a 
] set of arbitrary bytes and you're going to prepare a fragment of a URI from 
] them, and this will be done in a manner that will "just work" 80% of the 
] time, regardless of the requirements of specific schemes and contexts. You 
] can do this.

That applies to my situation.

> Stop. The unreserved set is a set of characters, not bytes.

This confused me in the spec.  When it said, "or consistency, percent-encoded 
octets in the ranges of..." I assumed that the bytes specified were the 
unreserved set.  I _think_ that I understand my mistake now.

> 1) Input data: characters? Convert to bytes (UTF-8, preferably). Bytes? take 
>    as-is. Output: Unicode character buffer.
> 2) Input bytes corresponding, in ASCII, to characters in the unreserved set:
>    write as characters to output buffer (0x41 -> \u0041)
>    Other input bytes: write as percent-encoded octets 
>    (0x20 -> \u0025\u0032\u0030)
> 3) Serialize string buffer as an ASCII encoded string, or whatever is useful 
>    to you.

I think I undestand.  So say the program got the U+4F5B HAN IDEOGRAPH 
character, and the user wants to use UTF-16 as the character encoding for the 
bytes.  Then the program should:

1) Convert input character to bytes 0x4F 0x5B (step 1).

2a) Write the characters "%" "4" "F" for the first byte (step 2).
or
2b) Write character "O" (letter "Oh") for first input byte (step 2).

3) Write characters "%" "5" "B" for second input byte (step 2).

4) Output string of characters (step 3).

So the function maps bytes to characters.  From there, the user can apply 
whatever character encoding they wish to the character string.  Right?

The optional (2b) step confuses me.  The octet %4F is in the ALPHA range, 
which the spec says should be converted to its ALPHA character [2].  Unless 
they mean "unreserved characters NOT bytes," in which case it can't be 
converted to "O" and step (2b) is wrong (shouldn't be there).

In the case (2b) is wrong, I presume the general encoding steps are 
(assuming all reserved characters must be percent encoded in this 
URI fragment) when beginning with characters:

1) If character is unreserved, pass through.  Otherwise:
2) Convert character to bytes in encoding "X".
3) Percent encode each byte (%xx %yy ...) and pass those characters through.

So this algorithm converts characters to characters.

Take the U+4F5B HAN IDEOGRAPH unicode character again, and the transcription 
encoding UTF-16BE, for another example.  To get a URI fragment:

1) Character is not unreserved so don't pass through.
2) Convert to bytes in UTF-16: 0x4F 0x5B.
3) Percent encode btyes: %4F %5B.

So the output is the string "%4F%5B".  The octet "%4F" cannot be converted to 
the letter "O" ("Oh") since in UTF-16BE, the letter is "%00%4F".  Also, since 
UTF-16BE encodes 2 bytes at a time, the string "%XX%00%4F%5B" (where %XX is 
anything) can never be normalized with an "O" in the middle since two octets 
must be decoded at a time (that is, for decoding a URI, octets bytes must be 
read in to according that encoding's "finite state table").

I think the second algorithm is the one to use, but I'm not sure.

Finally, is encoding "X" is what RDF 3986 (in section 2.5) calls 
"data format encoding?"

> Lastly, I can't help but wonder if you're reinventing the wheel. RFC 3986 is 
> new and does change a few aspects of RFC 2396, but RFC 2396 based percent 
> encoding APIs have long been available in Java, and the differences between 
> 3986 and 2396 are not all that significant for the kind of work you're 
> doing. I'm sure every API could use some refinement, but it may not be 
> crucial for your application... people have been winging it for years and 
> years now, with half-baked APIs based on half-baked specifications...

None of the APIs that I investigated did what I needed.  They either work with 
only a limited set of encodings [3, 4] or encode/ignore characters that I 
don't want encoded/ignored [4, 5].  I searched several more references before 
decided to implement this myself.

I appreciate your help.  Thanks!

--
Jimmy Cerra

P.S.  I rewrote this responce several times as I came to understand you post.
Please excuse (or point out) and incongruities.

[1] http://www.wiwiss.fu-berlin.de/suhl/bizer/d2rmap/D2Rmap.htm

[2] RFC 3986, Sections 2.3 and 2.4.

[3] http://skew.org/xml/stylesheets/url-encode/

[4] http://www.w3.org/International/O-URL-code.html

[5] java.net.URLEncoder and java.net.URLDecoder


		
__________________________________ 
Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/
Received on Sunday, 27 March 2005 09:46:00 UTC