- From: Jonathan Rosenne <rosenne@qsm.co.il>
- Date: Mon, 24 Apr 2000 00:46:24 +0200
- To: jturley@xai.com, "'Martin J. Duerst'" <duerst@w3.org>, www-international@w3.org
I had invented an algorithm that guesses both language and encoding. I coded it in C for Hebrew and English years ago and it worked quite well for plain text, even when languages were slightly mixed. Maybe I should dig it up. I have to modify it to ignore HTML tags etc. Jony > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org]On Behalf Of James Turley > Sent: Sunday, April 23, 2000 9:22 PM > To: 'Martin J. Duerst'; www-international@w3.org > Subject: RE: Language detection or Character Encoding detection (was: > RE:Automatic Language Detect) > > > Duh... > > Yes, Martin, of course you are right. Character encoding does not imply > language always. Guess I am > a CJK guy at heart where, in most cases, encoding usually does imply > Language, but maybe not > dialect. After all Shift JIS is usually Japanese (unless you used the Greek > characters to write Greek, or > double width (zenkaku) Roman to write English :-). > > This brings up an interesting larger issue: What about "language/dialect" > detection? > For example, deduce Cantonese from Big5 (or Unicode). Or deduce > Bahasa Indonesian from ANSI. Ebonics from ASCII? Hmmm...an enumeration API > to list potential "language/dialect" pairs. > > Jim Turley, XAI > > -- > XA International 14510 Big Basin Way, #240 > Contract Programming Agency Saratoga, CA 95070 > International Software Engineering mailto:info@xai.com > http://www.xai.com > +1 408 741 5577 Voice +1 408 741 0512 FAX > > > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org]On Behalf Of Martin J. Duerst > Sent: Saturday, April 22, 2000 8:27 PM > To: 'jturley@xai.com'; www-international@w3.org > Cc: Chris Pratley > Subject: Language detection or Character Encoding detection (was: RE: > Automatic Language Detect) > > > James - Are you speaking about language detection or detection > of character encoding (codepage/'charset'/whatever you call it)? > > Though the two are somewhat related, they are actually different. > The method name that Chris is giving below seems to indicate > that this is about detection of character encoding. > > Regards, Martin. > > At 00/04/21 18:43 -0700, Chris Pratley wrote: > >James, > > > >Just FYI, MLANG.DLL is not a secret. It is available to third parties to > use > >via the IEAK on http://www.microsoft.com/windows/ieak. > > > >More detailed info on this specific aspect: > >IMultiLanguage2::DetectInputCodePage method. > >http://msdn.microsoft.com/workshop/misc/mlang/reference/IFaces/IMultiLangua > g > >e2/detectinputcodepage.asp > > > >This is a little easier than going through Jet. More info can be had from > >http://msdn.microsoft.com . Search on "MLANG". > > > >Regards, > >Chris Pratley > >Group Program Manager > >Microsoft Word > > > >Sent using Office10 build 1617ship wordmail on > > > > > >-----Original Message----- > >From: James Turley [mailto:jturley@xai.com] > >Sent: April 19, 2000 4:29 PM > >To: www-international@w3.org > >Subject: Automatic Language Detect > > > >Hello... > > > > > > > >I will jump in here, re language Auto Detect for Win9x/NT/2K platforms. > > > >I always wondered how IE5 "Autodetected" languages > > > >through the [Right Mouse Click]-->Encoding-->AutoSelect > > > >process. And..I was hoping never to have to write > > > >one to support the 127 locales supported by Windows 2000. > > > >Well...seems like you don't have to write any code, due > > > >do some undocumented but useful features of Jet. 4.0. > > > > > > > >While I was giving a seminar in Redmond, a reliable but > > > >unamed PM from MS Office 2K let me in on the secret. > > > >Microsoft Jet OLEDB 4.0 Text and installable > > > >indexed-sequential access method (IISAM) uses > > > >MLANG.dll which provides Language autodetect functionality > > > >for all MS products through "Extended Properties" setting. You get > > > >this for free with all MS OS's and maybe MacOS. > > > > > > > >So...if you are using ADO (DAO works too, I am told)... > > > >in VB for example, just set up an ADO connection, > > > >set Provider as "Microsoft.Jet.OLEDB.4.0" and > > > >set Properties("Extended Properties") = "TEXT;CharacterSet=Detect;" & > > > > "Locale=ALL;" > > > > > > > >This will return an enumeration recordset of "guesses" about the locales > > > >represented > > > >by your text, sorted in best guess order. > > > > > > > >Pretty good..and you don't even have to write any code. > > > >I think it may work on mac too. > > > > > > > >email me offlist for some coding fragments. > > > > > > > >Jim Turley, XAI mailto:jturley@xai.com > > > > > > > > > > > >-- > > > >XA International 14510 Big Basin Way, #240 > > > >Contract Programming Agency Saratoga, CA 95070 > > > >International Software Engineering mailto:info@xai.com > > > >http://www.xai.com > > > >+1 408 741 5577 Voice +1 408 741 0512 FAX > > > > > > > > > > > >-----Original Message----- > > > >From: www-international-request@w3.org > > > >[mailto:www-international-request@w3.org]On Behalf Of Santosh Rau (by > > > >way of "Martin J. Duerst" <duerst@w3.org>) > > > >Sent: Tuesday, April 18, 2000 8:58 PM > > > >To: www-international@w3.org > > > >Subject: Re: Japanese encoding? > > > > > > > > > > > >Hello, > > > > > > > >Can someone point me to RFCs/documentation on how these browsers > > > >'auto-detect' the encoding used on japanese pages ?€$B[5(Bhis is for the > > > >case where the 'charset' is not specified. I have two URLs for which the > > > >browsers work correctly: > > > > > > > >www.yahoo.co.jp > > > >www.kantei.go.jp > > > > > > > >Thanks > > > >Santosh Rau > > > >Netmind > > > > > > > >
Received on Sunday, 23 April 2000 17:48:54 UTC