Re: charset and language of C strings?

From: Benjamin Franz <snowhare@netimages.com>
Date: Fri, 6 Mar 1998 05:17:54 -0800 (PST)
To: www-international@w3.org
Message-ID: <Pine.LNX.3.96.980306045211.28099A-100000@ns.viet.net>
On Thu, 5 Mar 1998, Bill Janssen wrote:

> I'd like to find an algorithm to determine the charset and language
> (in the sense of those terms defined by IETF RFC 2277,
> http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2277.txt) of a C
> string, probably using the information returned by a call to setlocale:
> 	current_locale = setlocale (LC_ALL, NULL);
> Is this in any way standardized?  Are there good heuristics that
> can be used?

Do you mean 'meta' information about a C string that is already known to
the system somehow (kind of preset 'string type') or do you mean the
language and charset of an unknown string (not specifically in 'C' - that
just bein g the implementation language)?

In the first case, I am not aware of anything in 'C' that 'pre-tags' a
string as being a container for a specific charset and language.

In the second case, statistical analysis is the only general approach I
know of. Lots of words, frequencies and encodings in a big database. 
Thats how I detect the various Vietnamese charset encodings (10 or so of
them) for my search engine - brute force statistical analysis of word
frequencies in the various encodings it might be in. Best match wins.
Works more accurately on long texts than short ones. I need the detection
because my search engine converts everything to UTF16 internally so 
a search presented will find all matches regardless of the original
encoding representation.

Benjamin Franz
