- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 19 Apr 2000 18:13:31 +0900
- To: Santosh Rau <santosh@NetMind.com>, www-international@w3.org
Below is some code that I wrote a few years ago, based on some code by Jim Breen. At 00/04/19 12:58 +0900, Santosh Rau wrote: >Hello, > >Can someone point me to RFCs/documentation/Mozilla code which describes >how 'auto-detection' in browsers (particularly Japanese) is really done. >I have to write some code which can do some auto-detection if 'charset' >is not specified by the web server. > >Thanks in advance > >Santosh The code below is written for a particular C++ library, but changing it to plain C or whatever should be very easy. The advantage of the code below is that it also checks for UTF-8. Guessing is not really a good thing to do, but in this case it's so widespread that it's difficult to do without. Somewhere, the constants for eEOS, eUTF, eJIS,... have to be defined. The idea of the code is that it starts out assuming that the input could be anything, and works as long is there is still more than one solution. The state machines for the various encodings work in parallel. No guarantees that this works correctly. Regards, Martin. /---- GuessJapanese -------------------------------------------- int GuessJapanese (IStream &from) { int status = eEOS | eUTF | eJIS; // everything! u_char c; int sjisStatus = 0; int eucStatus = 0; Tab *utfT = utfTab; int utfC; while (from.get(c), !from.eof()) { int variants = 0; // JIS state machine if (status & eJIS) { variants++; if (status == eJIS) { // after escape, decide details if (c == 'K') status = eNEC; else if (c == '$') status = eNEW | eOLD, variants++; else if (c == '&') status = eNEW; // JIS 1990, eNEWER? } else if (status == (eNEW|eOLD)) { // after escape and $ if (c == 'B') status = eNEW; else if (c == '@') status = eOLD; else if (c == '(') status = eNEW; // JIS X 212 else status = eNUL; } else if (c == 0x1b) status = eJIS, variants++; // escape else if (c>127) status &= ~eJIS; } // EUC state machine if (status & eEUC) { variants++; if (eucStatus == 0) if (c < 128) ; else if (c == 142) eucStatus = 2; // set #2, half kana else if (c == 143) eucStatus = 3; // set #3, JIS-212 else if (c>=128 && c<=160) status &= ~eEUC; else eucStatus = 1; else if (eucStatus == 1) if (c <= 160) status &= ~eEUC; else eucStatus = 0; else if (eucStatus == 2) if (c<=160 || c>=224) status &= ~eEUC; else eucStatus = 0; else // eucStatus == 3 if (c>=128 && c<=160) status &= ~eEUC; else eucStatus = 1; } // SJIS state machine if (status & eSJS) { variants++; if (!sjisStatus) if (c<128 || c>=161 && c<224) ; else if (c==160 || c==128 || c>239) status &= ~eSJS; else sjisStatus = 1; else if (c<64 || c>252 || c==127) status &= ~eSJS; else sjisStatus = 0; } // UTF state machine if (status & eUTF) { variants++; if (!utfT->shift) utfC = c; if (utfT->shift && ((c&0xC0) != 0x80)) status &= ~eUTF; else if ((utfC & utfT->cmask) == utfT->cval) utfT = utfTab; else if (!((++utfT)->cmask)) status &= ~eUTF; } if (variants <= 1) break; } return status; }
Received on Wednesday, 19 April 2000 05:11:09 UTC