URLUTF8Encoder bug ? from Kenji Kondo on 1999-08-30 (www-international@w3.org from July to September 1999)

From: Kenji Kondo <kkondo@pacbell.net>
Date: Mon, 30 Aug 1999 00:06:21 -0700
To: <www-international@w3.org>
Message-ID: <LOBBLJFAGBEPMIPICADGMEPMCFAA.kkondo@pacbell.net>

Hi all,

While I was studying the URLUTF8Encoder source code on w3 web page (http://www.w3.org/International/URLUTF8Encoder.java), I hit a question about URLUTF8 encoding.

URLUTF8Encoder always puts 0xc0 into the first byte of  multi-byte encoded character. Even character above 0x07ff is encoded to %cX %XX %XX by URLUTF8Encoder. I think this is wrong. In UTF8 encoding, a character  encoded in 3 byte sequence must start with %eX.  For example, UNICODE character \u65e5 should be "0xe6, 0x97, 0xa5" in UTF8 encoding. But URLUTF8Encoder encodes this character into "%c6%97%a5". 

Actually decode method(unescape) doesn't check whether UTF8 encoding is well-formed or not, so it can decode "%c6%97%a5" to \u65e5. But for other decoding implementations that check well-formed UTF8, URLUTF8Encoder may not work with them.

The line# 89 in URLUTF8Encoder.java has to be:
	sbuf.append(hex[0xe0 | (ch >> 12)]);

How do you think?

Received on Monday, 30 August 1999 03:05:24 UTC