- From: Rick Cameron <Rick.Cameron@crystaldecisions.com>
- Date: Fri, 14 Sep 2001 11:59:17 -0700
- To: html-tidy@w3.org
- Cc: Terry Teague <terry_teague@users.sourceforge.net>
Hi, all I believe the code in tidy.c (revision 1.35) that reads Shift-JIS characters has a problem. At line 1040 there appears to be the assumption that any byte value greater than 127 is a lead byte (i.e. the first byte of a two-byte character). I believe this is true of Big5 - but it is certainly not true of Shift-JIS. In the diagram at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnintl/html /S24CF.asp?frame=true you can see that the values from 0xa1 through 0xdf represent singe-byte characters. One way to fix this would be to introduce a rough-and-ready version of the Win32 IsDBCSLeadByte function. For example: static Bool isDBCSLeadByte (uint c, int encoding) { switch (encoding) { case BIG5: return c >= 0x80; case SHIFTJIS: return c >= 0x80 && !(c >= 0xa1 && c <= 0xdf); default: return no; } } Thanks! - rick
Received on Friday, 14 September 2001 14:59:54 UTC