W3C home > Mailing lists > Public > uri@w3.org > March 2005

Encoding URI From/To UTF-16 Questions

From: James Cerra <jfcst24_public@yahoo.com>
Date: Wed, 23 Mar 2005 20:58:02 -0800 (PST)
Message-ID: <20050324045802.13500.qmail@web42204.mail.yahoo.com>
To: uri@w3.org

I'm writing a converter in Java for percent-encoding characters not-UNRESERVED
bytes according to RFC 3986 [1].  There are a few questions I have when
encoding to/from non ASCII-like character encoding - esp. UTF-16 BE/LE and
other encoding that use more than one byte per character.

So far, here's the algorithm that I inferred from the spec:

1) Given an input byte stream and output byte stream.
2) If an input byte is in the UNRESERVED set [2] then write to the 
   output stream.
3) Otherwise write 0x25 [3] and then the two byte hex version of the 
   input byte, in ASCII, to output stream.
4) Continue on to end of stream.  Output stream is in ASCII.

Now this works well for character encodings like ISO-8859-1, but leads to weird
results with other encodings.  Say I have to encode "google" from UTF-16 (LE). 
Then processing the output one byte at a time leads to:


Is this correct?  The specs say that one should encode to UTF-8 for textual
data [4].  But what about non-textual data?  And how should one interpret the
scheme component - i.e. "http://" - in a string starting from UTF-16?  Surely
the output shouldn't be "%00h%00t%00t%00p..."!

P.S.  I'm sorry if this is the wrong forum for this discussion.  I've made this
mistake before and it is embarassing.  If there is a more proper place for my
questions, please direct me there (and/or answer my question)!  Thanks in
advance for understanding.

[1] http://www.gbiv.com/protocols/uri/rfc/rfc3986.html

[2] Inclusively the bytes: 0x61 - 0x7A, 0x40 - 0x5A, 0x30 - 0x39, 0x2D, 0x2E,
0x5F, 0x7E

[3] The "%" character in ASCII.

[4] Section 2.5

Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
Received on Thursday, 24 March 2005 04:58:33 UTC

This archive was generated by hypermail 2.4.0 : Sunday, 10 October 2021 22:17:47 UTC