W3C home > Mailing lists > Public > www-international@w3.org > April to June 2000

RE: Language detection or Character Encoding detection (was: RE: Automatic Language Detect)

From: James Turley <jturley@xai.com>
Date: Sun, 23 Apr 2000 12:22:12 -0700
To: "'Martin J. Duerst'" <duerst@w3.org>, <www-international@w3.org>
Message-ID: <000001bfad59$3af39600$0300a8c0@eureka>
Duh...

Yes, Martin, of course you are right. Character encoding does not imply
language always. Guess I am
a CJK guy at heart where, in most cases, encoding usually does imply
Language, but maybe not
dialect. After all Shift JIS is usually Japanese (unless you used the Greek
characters to write Greek, or
double width (zenkaku) Roman to write English :-).

This brings up an interesting larger issue: What about "language/dialect"
detection?
For example, deduce Cantonese from Big5 (or Unicode). Or deduce
Bahasa Indonesian from  ANSI. Ebonics from ASCII?  Hmmm...an enumeration API
to list potential "language/dialect" pairs.

Jim Turley, XAI

--
XA International                        14510 Big Basin Way, #240
Contract Programming Agency             Saratoga, CA 95070
International Software Engineering      mailto:info@xai.com
http://www.xai.com
+1 408 741 5577 Voice                    +1 408 741 0512 FAX


-----Original Message-----
From: www-international-request@w3.org
[mailto:www-international-request@w3.org]On Behalf Of Martin J. Duerst
Sent: Saturday, April 22, 2000 8:27 PM
To: 'jturley@xai.com'; www-international@w3.org
Cc: Chris Pratley
Subject: Language detection or Character Encoding detection (was: RE:
Automatic Language Detect)


James - Are you speaking about language detection or detection
of character encoding (codepage/'charset'/whatever you call it)?

Though the two are somewhat related, they are actually different.
The method name that Chris is giving below seems to indicate
that this is about detection of character encoding.

Regards,   Martin.

At 00/04/21 18:43 -0700, Chris Pratley wrote:
>James,
>
>Just FYI, MLANG.DLL is not a secret. It is available to third parties to
use
>via the IEAK on http://www.microsoft.com/windows/ieak.
>
>More detailed info on this specific aspect:
>IMultiLanguage2::DetectInputCodePage method.
>http://msdn.microsoft.com/workshop/misc/mlang/reference/IFaces/IMultiLangua
g
>e2/detectinputcodepage.asp
>
>This is a little easier than going through Jet. More info can be had from
>http://msdn.microsoft.com . Search on "MLANG".
>
>Regards,
>Chris Pratley
>Group Program Manager
>Microsoft Word
>
>Sent using Office10 build 1617ship wordmail on
>
>
>-----Original Message-----
>From: James Turley [mailto:jturley@xai.com]
>Sent: April 19, 2000 4:29 PM
>To: www-international@w3.org
>Subject: Automatic Language Detect
>
>Hello...
>
>
>
>I will jump in here, re language Auto Detect for Win9x/NT/2K platforms.
>
>I always wondered how IE5 "Autodetected" languages
>
>through the [Right Mouse Click]-->Encoding-->AutoSelect
>
>process. And..I was hoping never to have to write
>
>one to support the 127 locales supported by Windows 2000.
>
>Well...seems like you don't have to write any code, due
>
>do some undocumented but useful features of Jet. 4.0.
>
>
>
>While I was giving a seminar in Redmond, a reliable but
>
>unamed PM from MS Office 2K let me in on the secret.
>
>Microsoft Jet OLEDB 4.0 Text and installable
>
>indexed-sequential access method (IISAM) uses
>
>MLANG.dll which provides Language autodetect functionality
>
>for all MS products through "Extended Properties" setting. You get
>
>this for free with all MS OS's and maybe MacOS.
>
>
>
>So...if you are using ADO (DAO works too, I am told)...
>
>in VB for example, just set up an ADO connection,
>
>set Provider as "Microsoft.Jet.OLEDB.4.0" and
>
>set Properties("Extended Properties") = "TEXT;CharacterSet=Detect;" &
>
>    "Locale=ALL;"
>
>
>
>This will return an enumeration recordset of "guesses" about the locales
>
>represented
>
>by your text, sorted in best guess order.
>
>
>
>Pretty good..and you don't even have to write any code.
>
>I think it may work on mac too.
>
>
>
>email me offlist for some coding fragments.
>
>
>
>Jim Turley, XAI mailto:jturley@xai.com
>
>
>
>
>
>--
>
>XA International                        14510 Big Basin Way, #240
>
>Contract Programming Agency             Saratoga, CA 95070
>
>International Software Engineering      mailto:info@xai.com
>
>http://www.xai.com
>
>+1 408 741 5577 Voice                    +1 408 741 0512 FAX
>
>
>
>
>
>-----Original Message-----
>
>From: www-international-request@w3.org
>
>[mailto:www-international-request@w3.org]On Behalf Of Santosh Rau (by
>
>way of "Martin J. Duerst" <duerst@w3.org>)
>
>Sent: Tuesday, April 18, 2000 8:58 PM
>
>To: www-international@w3.org
>
>Subject: Re: Japanese encoding?
>
>
>
>
>
>Hello,
>
>
>
>Can someone point me to RFCs/documentation on how these browsers
>
>'auto-detect' the encoding used on japanese pages ?$B[5(Bhis is for the
>
>case where the 'charset' is not specified. I have two URLs for which the
>
>browsers work correctly:
>
>
>
>www.yahoo.co.jp
>
>www.kantei.go.jp
>
>
>
>Thanks
>
>Santosh Rau
>
>Netmind
>
>
>
Received on Sunday, 23 April 2000 15:23:33 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT