- From: Lee Passey <lee@novonyx.com>
- Date: Fri, 05 Oct 2001 12:01:53 -0600
- To: "html-tidy@w3.org" <html-tidy@w3.org>
Bjoern Hoehrmann wrote: > > * Surfbird Fang wrote: > >>[-raw for unknown character encodings] > >Although it seems to work for everybody, but still something trouble. The > >  entity is parsed with '?' (the HEX code is #A030 ). > > Could you please give some example? To me, is converted to a > single byte value, i.e. 0xA0. Ok, this may cause some trouble, but > -raw is in general said to cause trouble, especially for entities. This is not really responsive, but ... Walking through the code in the debugger, it appears that _is_ converted to 0xA0, but because the internal representation of text is in UTF8, this is stored internally as a _double_ byte value, 0xC2A0. This can cause confusion as the second byte of the pair is identical in value to the single byte value! Thus if you do something like: // convert non-breaking space to space if ((unsigned char) lexer->lexbuf[i] == 0xA0) lexer->lexbuf[i] = ' '; you will leave a 0xC2 dangling in the buffer, becoming a UTF8 0xC220 character (I don't know what this will become, but I don't think it's valid). Someone attempted to work around this in the function NormalizeSpaces() in clean.c by getting a UTF8 character, comparing it to 160 (0xA0), and then replacing it with ' '. Unfortunately, the node->end value is not adjusted when one of these is found, replacing a two-byte sequence with a single byte, potentially leading to garbage characters appearing in a text node. Of course, on output, a raw 0xA0, or " " or " " is exactly what we would expect to see.
Received on Friday, 5 October 2001 13:58:40 UTC