Re: Japanese encoding? from Yung-Fong Tang on 2000-04-19 (www-international@w3.org from April to June 2000)

From: Yung-Fong Tang <ftang@netscape.com>
Date: Wed, 19 Apr 2000 13:01:04 -0700
To: "Yu Herbert" <Herbert.Yu@schwab.com>
CC: "'Santosh Rau'" <santosh@NetMind.com>, www-international@w3.org
Message-ID: <38FE107F.D7C691C0@netscape.com>

I wrote the current Mozilla "auto-detector"
The code is under http://lxr.mozilla.org/seamonkey/source/intl/chardet/src/
Currently, there are 2 KIND of implementation. One for CJK and one for
Russian. Both of them implement the SAME INTERFACE. Therefore, if you have a
new way to do it, it is possible to provide additional implmentation without
change the current code.

The CJK one is done by a parallel state machine approach. I analysis the
possible state transition in each charset and combine them together.

The CJK detector code is under
 nsPSMDetectors.h  nsPSMDetectors.cpp  nsVerifier.h
Individual state machine tables are in
 nsBIG5Verifier.h  nsCP1252Verifier.h  nsEUCJPVerifier.h  nsEUCKRVerifier.h
nsEUCTWVerifier.h  nsGB2312Verifier.h  nsHZVerifier.h  nsISO2022CNVerifier.h
nsISO2022JPVerifier.h  nsISO2022KRVerifier.h  nsSJISVerifier.h
nsUCS2BEVerifier.h  nsUCS2LEVerifier.h  nsUTF8Verifier.h

These .h file are generated by the following perl script
 mozilla/intl/chardet/tools/
Call  mozilla/intl/chardet/tools/gen.cmd on window to generate them.

For the russian detectors, it is implement a statistic model which I found on
the Perl distribution. I convert it into C++
You can find the implementation under  nsCyrillicDetector.cpp
nsCyrillicClass.h  nsCyrillicDetector.h  nsCyrillicProb.h and teh
nsCyrillicClass.h is generated by  GenCyrillicClass.cpp and  gencyrillic.pl

To implement a new charset detector, you basically need to do the following
1. Implement a class which implement   nsICharsetDetector (
mozilla/intl/chardet/public/nsICharsetDetector.h ) and
nsIStringCharsetDetector (
mozilla/intl/chardet/public/nsIStringCharsetDetector.h )

2. Regiser your class under the same BASE PROGID
3. Add Additional registeration - see
mozilla/intl/chardet/src/nsCharDetModule.cpp

340   rv = registry -> AddSubtree(nsIRegistry::Common,
341                               NS_CHARSET_DETECTOR_REG_BASE
"ja_parallel_state_machine" ,&key);
342   if (NS_SUCCEEDED(rv)) {
343     rv = registry-> SetStringUTF8(key, "type",
"ja_parallel_state_machine");
344     rv = registry-> SetStringUTF8(key, "defaultEnglishText", "Japanese");
345   }

4. Provide English Translation of the name of the detector in property file.
see the last sectoin in
http://lxr.mozilla.org/seamonkey/source/intl/uconv/src/charsetTitles.properties

5. It could be implement in different DLL

If you just want to use the auto detection funtionality. You can
1. down load mozilla and ues the "DetectCharset" command line, or
2. write a class which call  nsICharsetDetector.h or
nsIStringCharsetDetector.h .

Also, Ken Lunde's book "CJKV Information process" have a good description
about the japanese auto detection. He even have Perl code which do that for
you.


>
> -----Original Message-----
> From: Santosh Rau [mailto:santosh@NetMind.com]
> Sent: Tuesday, April 18, 2000 8:58 PM
> To: www-international@w3.org
> Subject: Re: Japanese encoding?
>
> Hello,
>
> Can someone point me to RFCs/documentation/Mozilla code which describes
> how 'auto-detection' in browsers (particularly Japanese) is really done.
> I have to write some code which can do some auto-detection if 'charset'
> is not specified by the web server.
>
> Thanks in advance
>
> Santosh

Received on Wednesday, 19 April 2000 16:04:01 UTC