- From: Jun Kuriyama <kuriyama@sky.rim.or.jp>
- Date: Wed, 13 Oct 1999 23:36:41 +0900
- To: html-tidy@w3.org
- Cc: Jun Kuriyama <kuriyama@sky.rim.or.jp>
When I use -raw option with EUC-JP encoding, some entity references (such as ©) are converted to ISO-8859-1 (?) character code. But that code is not re-converted to entity reference with -raw option. # EUC-JP encoding uses 8th bit. Japanese characters in this encoding # may include 0xA0-0xFF character code. Then EUC-JP cannot co-exist # with other 8bit encodings. So I like -raw option not to modify any entity references in input and print out as-is. Can tidy accept this approach? Index: lexer.c =================================================================== RCS file: /tmp/tidycvs/tidy/lexer.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 lexer.c --- lexer.c 1999/10/13 14:06:29 1.1.1.3 +++ lexer.c 1999/10/13 14:06:39 @@ -358,15 +358,21 @@ ReportEntityError(lexer, MISSING_SEMICOLON, lexer->lexbuf+start, c); } - lexer->lexsize = start; - AddCharToLexer(lexer, ch); + if (lexer->in->encoding == RAW) + if (semicolon) + AddCharToLexer(lexer, ';'); + else + { + lexer->lexsize = start; + AddCharToLexer(lexer, ch); - if (ch == '&' && !QuoteAmpersand) - { - AddCharToLexer(lexer, 'a'); - AddCharToLexer(lexer, 'm'); - AddCharToLexer(lexer, 'p'); - AddCharToLexer(lexer, ';'); + if (ch == '&' && !QuoteAmpersand) + { + AddCharToLexer(lexer, 'a'); + AddCharToLexer(lexer, 'm'); + AddCharToLexer(lexer, 'p'); + AddCharToLexer(lexer, ';'); + } } } } Index: pprint.c =================================================================== RCS file: /tmp/tidycvs/tidy/pprint.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 pprint.c --- pprint.c 1999/10/13 14:06:30 1.1.1.3 +++ pprint.c 1999/10/13 14:06:39 @@ -291,7 +291,7 @@ } /* except in CDATA map < to < etc. */ - if (! (mode & CDATA) ) + if (!(mode & CDATA) && CharEncoding != RAW) { if (c == '<') { Jun Kuriyama // kuriyama@sky.rim.or.jp // kuriyama@FreeBSD.org
Received on Wednesday, 13 October 1999 10:37:03 UTC