- From: Kenji Kondo <kkondo@pacbell.net>
- Date: Mon, 30 Aug 1999 00:06:21 -0700
- To: <www-international@w3.org>
Hi all, While I was studying the URLUTF8Encoder source code on w3 web page (http://www.w3.org/International/URLUTF8Encoder.java), I hit a question about URLUTF8 encoding. URLUTF8Encoder always puts 0xc0 into the first byte of multi-byte encoded character. Even character above 0x07ff is encoded to %cX %XX %XX by URLUTF8Encoder. I think this is wrong. In UTF8 encoding, a character encoded in 3 byte sequence must start with %eX. For example, UNICODE character \u65e5 should be "0xe6, 0x97, 0xa5" in UTF8 encoding. But URLUTF8Encoder encodes this character into "%c6%97%a5". Actually decode method(unescape) doesn't check whether UTF8 encoding is well-formed or not, so it can decode "%c6%97%a5" to \u65e5. But for other decoding implementations that check well-formed UTF8, URLUTF8Encoder may not work with them. The line# 89 in URLUTF8Encoder.java has to be: sbuf.append(hex[0xe0 | (ch >> 12)]); How do you think?
Received on Monday, 30 August 1999 03:05:24 UTC