W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2001

RE: Problem processing Shift-JIS

From: Rick Cameron <Rick.Cameron@crystaldecisions.com>
Date: Fri, 14 Sep 2001 12:42:42 -0700
Message-ID: <A399661DAF14DA46AAB21B088F3887DC0200D666@vanent14.crystaldecisions.net>
To: html-tidy@w3.org
Cc: Terry Teague <terry_teague@users.sourceforge.net>
I've just submitted this problem at the SourceForge site.

Thanks

- rick

-----Original Message-----
From: Rick Cameron [mailto:Rick.Cameron@crystaldecisions.com]
Sent: Fri, 14 September 2001 11:59
To: html-tidy@w3.org
Cc: Terry Teague
Subject: Problem processing Shift-JIS


Hi, all

I believe the code in tidy.c (revision 1.35) that reads Shift-JIS characters
has a problem. At line 1040 there appears to be the assumption that any byte
value greater than 127 is a lead byte (i.e. the first byte of a two-byte
character). I believe this is true of Big5 - but it is certainly not true of
Shift-JIS. In the diagram at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnintl/html
/S24CF.asp?frame=true you can see that the values from 0xa1 through 0xdf
represent singe-byte characters.

One way to fix this would be to introduce a rough-and-ready version of the
Win32 IsDBCSLeadByte function. For example:

static Bool isDBCSLeadByte (uint c, int encoding)
{
	switch (encoding)
	{
		case BIG5:
			return c >= 0x80;

		case SHIFTJIS:
			return c >= 0x80 && !(c >= 0xa1 && c <= 0xdf);

		default:
			return no;
	}
}

Thanks!

- rick
Received on Friday, 14 September 2001 15:43:15 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:46 GMT