http://www.w3.org/Bugs/Public/show_bug.cgi?id=3550 ------- Comment #3 from mike@saxonica.com 2006-08-01 13:05 ------- UltraEdit in hex mode shows the character as C3 A4, but when I read the file into a Java InputStream and display the bytes I do indeed get c3 83 c2 a4 That's actually the UTF-8 encoding of C3 A4, which is the UTF-8 encoding of E4. So it's been doubly-encoded into UTF-8. I got confused by UltraEdit - in hex mode it doesn't actually show the octets present in the file, it shows the UTF-16 characters after decoding from UTF-8 I'm seeing the same byte sequence in the result file produced by Saxon, so I suspect this might be the cause of the problem. Perhaps I supplied a result file at some stage and this was incorporated into the distribution. I suspect this double-encoding is happening as a result of the way I do canonicalization - as it's done to both files it doesn't normally show up.Received on Tuesday, 1 August 2006 13:06:07 UTC
This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:13 UTC