- From: Yung-Fong Tang <ftang@netscape.com>
- Date: Wed, 19 Apr 2000 13:01:04 -0700
- To: "Yu Herbert" <Herbert.Yu@schwab.com>
- CC: "'Santosh Rau'" <santosh@NetMind.com>, www-international@w3.org
- Message-ID: <38FE107F.D7C691C0@netscape.com>
I wrote the current Mozilla "auto-detector" The code is under http://lxr.mozilla.org/seamonkey/source/intl/chardet/src/ Currently, there are 2 KIND of implementation. One for CJK and one for Russian. Both of them implement the SAME INTERFACE. Therefore, if you have a new way to do it, it is possible to provide additional implmentation without change the current code. The CJK one is done by a parallel state machine approach. I analysis the possible state transition in each charset and combine them together. The CJK detector code is under nsPSMDetectors.h nsPSMDetectors.cpp nsVerifier.h Individual state machine tables are in nsBIG5Verifier.h nsCP1252Verifier.h nsEUCJPVerifier.h nsEUCKRVerifier.h nsEUCTWVerifier.h nsGB2312Verifier.h nsHZVerifier.h nsISO2022CNVerifier.h nsISO2022JPVerifier.h nsISO2022KRVerifier.h nsSJISVerifier.h nsUCS2BEVerifier.h nsUCS2LEVerifier.h nsUTF8Verifier.h These .h file are generated by the following perl script mozilla/intl/chardet/tools/ Call mozilla/intl/chardet/tools/gen.cmd on window to generate them. For the russian detectors, it is implement a statistic model which I found on the Perl distribution. I convert it into C++ You can find the implementation under nsCyrillicDetector.cpp nsCyrillicClass.h nsCyrillicDetector.h nsCyrillicProb.h and teh nsCyrillicClass.h is generated by GenCyrillicClass.cpp and gencyrillic.pl To implement a new charset detector, you basically need to do the following 1. Implement a class which implement nsICharsetDetector ( mozilla/intl/chardet/public/nsICharsetDetector.h ) and nsIStringCharsetDetector ( mozilla/intl/chardet/public/nsIStringCharsetDetector.h ) 2. Regiser your class under the same BASE PROGID 3. Add Additional registeration - see mozilla/intl/chardet/src/nsCharDetModule.cpp 340 rv = registry -> AddSubtree(nsIRegistry::Common, 341 NS_CHARSET_DETECTOR_REG_BASE "ja_parallel_state_machine" ,&key); 342 if (NS_SUCCEEDED(rv)) { 343 rv = registry-> SetStringUTF8(key, "type", "ja_parallel_state_machine"); 344 rv = registry-> SetStringUTF8(key, "defaultEnglishText", "Japanese"); 345 } 4. Provide English Translation of the name of the detector in property file. see the last sectoin in http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetTitles.properties 5. It could be implement in different DLL If you just want to use the auto detection funtionality. You can 1. down load mozilla and ues the "DetectCharset" command line, or 2. write a class which call nsICharsetDetector.h or nsIStringCharsetDetector.h . Also, Ken Lunde's book "CJKV Information process" have a good description about the japanese auto detection. He even have Perl code which do that for you. > > -----Original Message----- > From: Santosh Rau [mailto:santosh@NetMind.com] > Sent: Tuesday, April 18, 2000 8:58 PM > To: www-international@w3.org > Subject: Re: Japanese encoding? > > Hello, > > Can someone point me to RFCs/documentation/Mozilla code which describes > how 'auto-detection' in browsers (particularly Japanese) is really done. > I have to write some code which can do some auto-detection if 'charset' > is not specified by the web server. > > Thanks in advance > > Santosh
Received on Wednesday, 19 April 2000 16:04:01 UTC