- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 19 Apr 2000 18:13:31 +0900
- To: Santosh Rau <santosh@NetMind.com>, www-international@w3.org
Below is some code that I wrote a few years ago, based
on some code by Jim Breen.
At 00/04/19 12:58 +0900, Santosh Rau wrote:
>Hello,
>
>Can someone point me to RFCs/documentation/Mozilla code which describes
>how 'auto-detection' in browsers (particularly Japanese) is really done.
>I have to write some code which can do some auto-detection if 'charset'
>is not specified by the web server.
>
>Thanks in advance
>
>Santosh
The code below is written for a particular C++ library,
but changing it to plain C or whatever should be very easy.
The advantage of the code below is that it also checks for
UTF-8.
Guessing is not really a good thing to do, but in this case
it's so widespread that it's difficult to do without.
Somewhere, the constants for eEOS, eUTF, eJIS,... have to be
defined.
The idea of the code is that it starts out assuming that
the input could be anything, and works as long is there is
still more than one solution. The state machines for the
various encodings work in parallel.
No guarantees that this works correctly.
Regards, Martin.
/---- GuessJapanese --------------------------------------------
int GuessJapanese (IStream &from)
{
int status = eEOS | eUTF | eJIS; // everything!
u_char c;
int sjisStatus = 0;
int eucStatus = 0;
Tab *utfT = utfTab;
int utfC;
while (from.get(c), !from.eof()) {
int variants = 0;
// JIS state machine
if (status & eJIS) {
variants++;
if (status == eJIS) { // after escape, decide
details
if (c == 'K') status = eNEC;
else if (c == '$') status = eNEW | eOLD,
variants++;
else if (c == '&') status = eNEW; //
JIS 1990, eNEWER?
}
else if (status == (eNEW|eOLD)) { // after
escape and $
if (c == 'B') status = eNEW;
else if (c == '@') status = eOLD;
else if (c == '(') status = eNEW; //
JIS X 212
else status = eNUL;
}
else if (c == 0x1b) status = eJIS,
variants++; // escape
else if (c>127) status &= ~eJIS;
}
// EUC state machine
if (status & eEUC) {
variants++;
if (eucStatus == 0)
if (c < 128) ;
else if (c == 142) eucStatus = 2; //
set #2, half kana
else if (c == 143) eucStatus = 3; //
set #3, JIS-212
else if (c>=128 && c<=160) status &= ~eEUC;
else eucStatus = 1;
else if (eucStatus == 1)
if (c <= 160) status &= ~eEUC;
else eucStatus = 0;
else if (eucStatus == 2)
if (c<=160 || c>=224) status &= ~eEUC;
else eucStatus = 0;
else // eucStatus == 3
if (c>=128 && c<=160) status &= ~eEUC;
else eucStatus = 1;
}
// SJIS state machine
if (status & eSJS) {
variants++;
if (!sjisStatus)
if (c<128 || c>=161 && c<224) ;
else if (c==160 || c==128 ||
c>239) status &= ~eSJS;
else sjisStatus = 1;
else
if (c<64 || c>252 || c==127) status &= ~eSJS;
else sjisStatus = 0;
}
// UTF state machine
if (status & eUTF) {
variants++;
if (!utfT->shift) utfC = c;
if (utfT->shift && ((c&0xC0) != 0x80)) status &=
~eUTF;
else if ((utfC & utfT->cmask) == utfT->cval) utfT
= utfTab;
else if (!((++utfT)->cmask)) status &= ~eUTF;
}
if (variants <= 1) break;
}
return status;
}
Received on Wednesday, 19 April 2000 05:11:09 UTC