Re: Japanese encoding? from Martin J. Duerst on 2000-04-19 (www-international@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 19 Apr 2000 18:13:31 +0900
To: Santosh Rau <santosh@NetMind.com>, www-international@w3.org
Message-Id: <4.2.0.58.J.20000419175953.0350d240@sh.w3.mag.keio.ac.jp>
Below is some code that I wrote a few years ago, based
on some code by Jim Breen.

At 00/04/19 12:58 +0900, Santosh Rau wrote:
>Hello,
>
>Can someone point me to RFCs/documentation/Mozilla code which describes
>how 'auto-detection' in browsers (particularly Japanese) is really done.
>I have to write some code which can do some auto-detection if 'charset'
>is not specified by the web server.
>
>Thanks in advance
>
>Santosh

The code below is written for a particular C++ library,
but changing it to plain C or whatever should be very easy.

The advantage of the code below is that it also checks for
UTF-8.

Guessing is not really a good thing to do, but in this case
it's so widespread that it's difficult to do without.

Somewhere, the constants for eEOS, eUTF, eJIS,... have to be
defined.

The idea of the code is that it starts out assuming that
the input could be anything, and works as long is there is
still more than one solution. The state machines for the
various encodings work in parallel.

No guarantees that this works correctly.

Regards,   Martin.


/---- GuessJapanese --------------------------------------------
int GuessJapanese (IStream &from)
{
         int status = eEOS | eUTF | eJIS;        // everything!
         u_char c;
         int sjisStatus = 0;
         int eucStatus = 0;
         Tab *utfT = utfTab;
         int utfC;
         while (from.get(c), !from.eof())  {
                 int variants = 0;
                 // JIS state machine
                 if (status & eJIS)  {
                         variants++;
                         if (status == eJIS)  {  // after escape, decide 
details
                                 if (c == 'K')  status = eNEC;
                                 else if (c == '$')  status = eNEW | eOLD, 
variants++;
                                 else if (c == '&')  status = eNEW;      // 
JIS 1990, eNEWER?
                         }
                         else if (status == (eNEW|eOLD))  {      // after 
escape and $
                                 if (c == 'B')  status = eNEW;
                                 else if (c == '@')  status = eOLD;
                                 else if (c == '(')  status = eNEW;      // 
JIS X 212
                                 else status = eNUL;
                         }
                         else if (c == 0x1b)  status = eJIS, 
variants++;         // escape
                         else if (c>127)  status &= ~eJIS;
                 }
                 // EUC state machine
                 if (status & eEUC)  {
                         variants++;
                         if (eucStatus == 0)
                                 if (c < 128)  ;
                                 else if (c == 142)  eucStatus = 2;      // 
set #2, half kana
                                 else if (c == 143)  eucStatus = 3;      // 
set #3, JIS-212
                                 else if (c>=128 && c<=160)  status &= ~eEUC;
                                 else eucStatus = 1;
                         else if (eucStatus == 1)
                                 if (c <= 160)  status &= ~eEUC;
                                 else  eucStatus = 0;
                         else if (eucStatus == 2)
                                 if (c<=160 || c>=224)  status &= ~eEUC;
                                 else  eucStatus = 0;
                         else // eucStatus == 3
                                 if (c>=128 && c<=160)  status &= ~eEUC;
                                 else eucStatus = 1;
                 }
                 // SJIS state machine
                 if (status & eSJS)  {
                         variants++;
                         if (!sjisStatus)
                                 if (c<128 || c>=161 && c<224)  ;
                                 else if (c==160 || c==128 || 
c>239)  status &= ~eSJS;
                                 else  sjisStatus = 1;
                         else
                                 if (c<64 || c>252 || c==127)  status &= ~eSJS;
                                 else  sjisStatus = 0;
                 }
                 // UTF state machine
                 if (status & eUTF)  {
                         variants++;
                         if (!utfT->shift)  utfC = c;
                         if (utfT->shift && ((c&0xC0) != 0x80))  status &= 
~eUTF;
                         else if ((utfC & utfT->cmask) == utfT->cval)  utfT 
= utfTab;
                         else if (!((++utfT)->cmask))  status &= ~eUTF;
                 }
                 if (variants <= 1)  break;
         }
         return  status;
}
Received on Wednesday, 19 April 2000 05:11:09 UTC