- From: Richard Tobin <richard@cogsci.ed.ac.uk>
- Date: Fri, 7 Nov 2003 23:25:14 GMT
- To: "Karl Waclawek" <karl@waclawek.net>, <public-xml-testsuite@w3.org>
- Cc: "Glenn Marcy" <gmarcy@us.ibm.com>, "Sandra Martinez" <sandra.martinez@nist.gov>, "Richard Tobin" <richard@cogsci.ed.ac.uk>
> According to table 3.1B of Unicode 3.2, the sequence e0 9f ac is not > a valid UTF-8 sequence. Right. But if you don't check that it's legal, and follow the natural algorithm for decoding it, you will get 7EC. Some implementations just apply the algorithm blindly without checking. There were two mistakes: the code point used was 2028 decimal (= 7EC hex) instead of 2028 hex. And 2028 decimal was encoded as a 3-byte sequence instead of a 2-byte sequence. -- Richard
Received on Friday, 7 November 2003 18:26:10 UTC