- From: <bugzilla@wiggum.w3.org>
- Date: Tue, 01 Aug 2006 13:05:53 +0000
- To: public-qt-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3550 ------- Comment #3 from mike@saxonica.com 2006-08-01 13:05 ------- UltraEdit in hex mode shows the character as C3 A4, but when I read the file into a Java InputStream and display the bytes I do indeed get c3 83 c2 a4 That's actually the UTF-8 encoding of C3 A4, which is the UTF-8 encoding of E4. So it's been doubly-encoded into UTF-8. I got confused by UltraEdit - in hex mode it doesn't actually show the octets present in the file, it shows the UTF-16 characters after decoding from UTF-8 I'm seeing the same byte sequence in the result file produced by Saxon, so I suspect this might be the cause of the problem. Perhaps I supplied a result file at some stage and this was incorporated into the distribution. I suspect this double-encoding is happening as a result of the way I do canonicalization - as it's done to both files it doesn't normally show up.
Received on Tuesday, 1 August 2006 13:06:07 UTC