RE: Language detection or Character Encoding detection (was: RE:Automatic Language Detect)

I had invented an algorithm that guesses both language and encoding. I coded it in C for
Hebrew and English years ago and it worked quite well for plain text, even when languages
were slightly mixed. Maybe I should dig it up. I have to modify it to ignore HTML tags
etc.

Jony

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of James Turley
> Sent: Sunday, April 23, 2000 9:22 PM
> To: 'Martin J. Duerst'; www-international@w3.org
> Subject: RE: Language detection or Character Encoding detection (was:
> RE:Automatic Language Detect)
>
>
> Duh...
>
> Yes, Martin, of course you are right. Character encoding does not imply
> language always. Guess I am
> a CJK guy at heart where, in most cases, encoding usually does imply
> Language, but maybe not
> dialect. After all Shift JIS is usually Japanese (unless you used the Greek
> characters to write Greek, or
> double width (zenkaku) Roman to write English :-).
>
> This brings up an interesting larger issue: What about "language/dialect"
> detection?
> For example, deduce Cantonese from Big5 (or Unicode). Or deduce
> Bahasa Indonesian from  ANSI. Ebonics from ASCII?  Hmmm...an enumeration API
> to list potential "language/dialect" pairs.
>
> Jim Turley, XAI
>
> --
> XA International                        14510 Big Basin Way, #240
> Contract Programming Agency             Saratoga, CA 95070
> International Software Engineering      mailto:info@xai.com
> http://www.xai.com
> +1 408 741 5577 Voice                    +1 408 741 0512 FAX
>
>
> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Martin J. Duerst
> Sent: Saturday, April 22, 2000 8:27 PM
> To: 'jturley@xai.com'; www-international@w3.org
> Cc: Chris Pratley
> Subject: Language detection or Character Encoding detection (was: RE:
> Automatic Language Detect)
>
>
> James - Are you speaking about language detection or detection
> of character encoding (codepage/'charset'/whatever you call it)?
>
> Though the two are somewhat related, they are actually different.
> The method name that Chris is giving below seems to indicate
> that this is about detection of character encoding.
>
> Regards,   Martin.
>
> At 00/04/21 18:43 -0700, Chris Pratley wrote:
> >James,
> >
> >Just FYI, MLANG.DLL is not a secret. It is available to third parties to
> use
> >via the IEAK on http://www.microsoft.com/windows/ieak.
> >
> >More detailed info on this specific aspect:
> >IMultiLanguage2::DetectInputCodePage method.
> >http://msdn.microsoft.com/workshop/misc/mlang/reference/IFaces/IMultiLangua
> g
> >e2/detectinputcodepage.asp
> >
> >This is a little easier than going through Jet. More info can be had from
> >http://msdn.microsoft.com . Search on "MLANG".
> >
> >Regards,
> >Chris Pratley
> >Group Program Manager
> >Microsoft Word
> >
> >Sent using Office10 build 1617ship wordmail on
> >
> >
> >-----Original Message-----
> >From: James Turley [mailto:jturley@xai.com]
> >Sent: April 19, 2000 4:29 PM
> >To: www-international@w3.org
> >Subject: Automatic Language Detect
> >
> >Hello...
> >
> >
> >
> >I will jump in here, re language Auto Detect for Win9x/NT/2K platforms.
> >
> >I always wondered how IE5 "Autodetected" languages
> >
> >through the [Right Mouse Click]-->Encoding-->AutoSelect
> >
> >process. And..I was hoping never to have to write
> >
> >one to support the 127 locales supported by Windows 2000.
> >
> >Well...seems like you don't have to write any code, due
> >
> >do some undocumented but useful features of Jet. 4.0.
> >
> >
> >
> >While I was giving a seminar in Redmond, a reliable but
> >
> >unamed PM from MS Office 2K let me in on the secret.
> >
> >Microsoft Jet OLEDB 4.0 Text and installable
> >
> >indexed-sequential access method (IISAM) uses
> >
> >MLANG.dll which provides Language autodetect functionality
> >
> >for all MS products through "Extended Properties" setting. You get
> >
> >this for free with all MS OS's and maybe MacOS.
> >
> >
> >
> >So...if you are using ADO (DAO works too, I am told)...
> >
> >in VB for example, just set up an ADO connection,
> >
> >set Provider as "Microsoft.Jet.OLEDB.4.0" and
> >
> >set Properties("Extended Properties") = "TEXT;CharacterSet=Detect;" &
> >
> >    "Locale=ALL;"
> >
> >
> >
> >This will return an enumeration recordset of "guesses" about the locales
> >
> >represented
> >
> >by your text, sorted in best guess order.
> >
> >
> >
> >Pretty good..and you don't even have to write any code.
> >
> >I think it may work on mac too.
> >
> >
> >
> >email me offlist for some coding fragments.
> >
> >
> >
> >Jim Turley, XAI mailto:jturley@xai.com
> >
> >
> >
> >
> >
> >--
> >
> >XA International                        14510 Big Basin Way, #240
> >
> >Contract Programming Agency             Saratoga, CA 95070
> >
> >International Software Engineering      mailto:info@xai.com
> >
> >http://www.xai.com
> >
> >+1 408 741 5577 Voice                    +1 408 741 0512 FAX
> >
> >
> >
> >
> >
> >-----Original Message-----
> >
> >From: www-international-request@w3.org
> >
> >[mailto:www-international-request@w3.org]On Behalf Of Santosh Rau (by
> >
> >way of "Martin J. Duerst" <duerst@w3.org>)
> >
> >Sent: Tuesday, April 18, 2000 8:58 PM
> >
> >To: www-international@w3.org
> >
> >Subject: Re: Japanese encoding?
> >
> >
> >
> >
> >
> >Hello,
> >
> >
> >
> >Can someone point me to RFCs/documentation on how these browsers
> >
> >'auto-detect' the encoding used on japanese pages ?€$B[5(Bhis is for the
> >
> >case where the 'charset' is not specified. I have two URLs for which the
> >
> >browsers work correctly:
> >
> >
> >
> >www.yahoo.co.jp
> >
> >www.kantei.go.jp
> >
> >
> >
> >Thanks
> >
> >Santosh Rau
> >
> >Netmind
> >
> >
> >
>
>

Received on Sunday, 23 April 2000 17:48:54 UTC