- From: Yung-Fong Tang <ftang@netscape.com>
- Date: Wed, 19 Apr 2000 13:01:04 -0700
- To: "Yu Herbert" <Herbert.Yu@schwab.com>
- CC: "'Santosh Rau'" <santosh@NetMind.com>, www-international@w3.org
- Message-ID: <38FE107F.D7C691C0@netscape.com>
I wrote the current Mozilla "auto-detector"
The code is under http://lxr.mozilla.org/seamonkey/source/intl/chardet/src/
Currently, there are 2 KIND of implementation. One for CJK and one for
Russian. Both of them implement the SAME INTERFACE. Therefore, if you have a
new way to do it, it is possible to provide additional implmentation without
change the current code.
The CJK one is done by a parallel state machine approach. I analysis the
possible state transition in each charset and combine them together.
The CJK detector code is under
nsPSMDetectors.h nsPSMDetectors.cpp nsVerifier.h
Individual state machine tables are in
nsBIG5Verifier.h nsCP1252Verifier.h nsEUCJPVerifier.h nsEUCKRVerifier.h
nsEUCTWVerifier.h nsGB2312Verifier.h nsHZVerifier.h nsISO2022CNVerifier.h
nsISO2022JPVerifier.h nsISO2022KRVerifier.h nsSJISVerifier.h
nsUCS2BEVerifier.h nsUCS2LEVerifier.h nsUTF8Verifier.h
These .h file are generated by the following perl script
mozilla/intl/chardet/tools/
Call mozilla/intl/chardet/tools/gen.cmd on window to generate them.
For the russian detectors, it is implement a statistic model which I found on
the Perl distribution. I convert it into C++
You can find the implementation under nsCyrillicDetector.cpp
nsCyrillicClass.h nsCyrillicDetector.h nsCyrillicProb.h and teh
nsCyrillicClass.h is generated by GenCyrillicClass.cpp and gencyrillic.pl
To implement a new charset detector, you basically need to do the following
1. Implement a class which implement nsICharsetDetector (
mozilla/intl/chardet/public/nsICharsetDetector.h ) and
nsIStringCharsetDetector (
mozilla/intl/chardet/public/nsIStringCharsetDetector.h )
2. Regiser your class under the same BASE PROGID
3. Add Additional registeration - see
mozilla/intl/chardet/src/nsCharDetModule.cpp
340 rv = registry -> AddSubtree(nsIRegistry::Common,
341 NS_CHARSET_DETECTOR_REG_BASE
"ja_parallel_state_machine" ,&key);
342 if (NS_SUCCEEDED(rv)) {
343 rv = registry-> SetStringUTF8(key, "type",
"ja_parallel_state_machine");
344 rv = registry-> SetStringUTF8(key, "defaultEnglishText", "Japanese");
345 }
4. Provide English Translation of the name of the detector in property file.
see the last sectoin in
http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetTitles.properties
5. It could be implement in different DLL
If you just want to use the auto detection funtionality. You can
1. down load mozilla and ues the "DetectCharset" command line, or
2. write a class which call nsICharsetDetector.h or
nsIStringCharsetDetector.h .
Also, Ken Lunde's book "CJKV Information process" have a good description
about the japanese auto detection. He even have Perl code which do that for
you.
>
> -----Original Message-----
> From: Santosh Rau [mailto:santosh@NetMind.com]
> Sent: Tuesday, April 18, 2000 8:58 PM
> To: www-international@w3.org
> Subject: Re: Japanese encoding?
>
> Hello,
>
> Can someone point me to RFCs/documentation/Mozilla code which describes
> how 'auto-detection' in browsers (particularly Japanese) is really done.
> I have to write some code which can do some auto-detection if 'charset'
> is not specified by the web server.
>
> Thanks in advance
>
> Santosh
Received on Wednesday, 19 April 2000 16:04:01 UTC