Re: Convering directly the chinese character encoding html to wellformed xml??

Bjoern Hoehrmann wrote:
> 
> * Surfbird Fang wrote:
> >>[-raw for unknown character encodings]
> >Although it seems to work for everybody, but still something trouble. The
> >&nbsp entity is parsed with '?' (the HEX code is #A030 ).
> 
> Could you please give some example? To me,   is converted to a
> single byte value, i.e. 0xA0. Ok, this may cause some trouble, but
> -raw is in general said to cause trouble, especially for entities.

This is not really responsive, but ...

Walking through the code in the debugger, it appears that   _is_
converted to 0xA0, but because the internal representation of text is in
UTF8, this is stored internally as a _double_ byte value, 0xC2A0.  This
can cause confusion as the second byte of the pair is identical in value
to the single byte value!  Thus if you do something like:

//  convert non-breaking space to space
if ((unsigned char) lexer->lexbuf[i] == 0xA0)
	lexer->lexbuf[i] = ' ';

you will leave a 0xC2 dangling in the buffer, becoming a UTF8 0xC220
character (I don't know what this will become, but I don't think it's
valid).

Someone attempted to work around this in the function NormalizeSpaces()
in clean.c by getting a UTF8 character, comparing it to 160 (0xA0), and
then replacing it with ' '.  Unfortunately, the node->end value is not
adjusted when one of these is found, replacing a two-byte sequence with
a single byte, potentially leading to garbage characters appearing in a
text node.

Of course, on output, a raw 0xA0, or " " or " " is exactly
what we would expect to see.

Received on Friday, 5 October 2001 13:58:40 UTC