- From: James Cerra <jfcst24_public@yahoo.com>
- Date: Wed, 23 Mar 2005 20:58:02 -0800 (PST)
- To: uri@w3.org
I'm writing a converter in Java for percent-encoding characters not-UNRESERVED bytes according to RFC 3986 [1]. There are a few questions I have when encoding to/from non ASCII-like character encoding - esp. UTF-16 BE/LE and other encoding that use more than one byte per character. So far, here's the algorithm that I inferred from the spec: 1) Given an input byte stream and output byte stream. 2) If an input byte is in the UNRESERVED set [2] then write to the output stream. 3) Otherwise write 0x25 [3] and then the two byte hex version of the input byte, in ASCII, to output stream. 4) Continue on to end of stream. Output stream is in ASCII. Now this works well for character encodings like ISO-8859-1, but leads to weird results with other encodings. Say I have to encode "google" from UTF-16 (LE). Then processing the output one byte at a time leads to: "%00g%00o%00o%00g%00l%00e" Is this correct? The specs say that one should encode to UTF-8 for textual data [4]. But what about non-textual data? And how should one interpret the scheme component - i.e. "http://" - in a string starting from UTF-16? Surely the output shouldn't be "%00h%00t%00t%00p..."! P.S. I'm sorry if this is the wrong forum for this discussion. I've made this mistake before and it is embarassing. If there is a more proper place for my questions, please direct me there (and/or answer my question)! Thanks in advance for understanding. [1] http://www.gbiv.com/protocols/uri/rfc/rfc3986.html [2] Inclusively the bytes: 0x61 - 0x7A, 0x40 - 0x5A, 0x30 - 0x39, 0x2D, 0x2E, 0x5F, 0x7E [3] The "%" character in ASCII. [4] Section 2.5 __________________________________ Do you Yahoo!? Yahoo! Small Business - Try our new resources site! http://smallbusiness.yahoo.com/resources/
Received on Thursday, 24 March 2005 04:58:33 UTC