- From: Benjamin Franz <snowhare@netimages.com>
- Date: Fri, 6 Mar 1998 05:17:54 -0800 (PST)
- To: www-international@w3.org
On Thu, 5 Mar 1998, Bill Janssen wrote: > I'd like to find an algorithm to determine the charset and language > (in the sense of those terms defined by IETF RFC 2277, > http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2277.txt) of a C > string, probably using the information returned by a call to setlocale: > > current_locale = setlocale (LC_ALL, NULL); > > Is this in any way standardized? Are there good heuristics that > can be used? Do you mean 'meta' information about a C string that is already known to the system somehow (kind of preset 'string type') or do you mean the language and charset of an unknown string (not specifically in 'C' - that just bein g the implementation language)? In the first case, I am not aware of anything in 'C' that 'pre-tags' a string as being a container for a specific charset and language. In the second case, statistical analysis is the only general approach I know of. Lots of words, frequencies and encodings in a big database. Thats how I detect the various Vietnamese charset encodings (10 or so of them) for my search engine - brute force statistical analysis of word frequencies in the various encodings it might be in. Best match wins. Works more accurately on long texts than short ones. I need the detection because my search engine converts everything to UTF16 internally so a search presented will find all matches regardless of the original encoding representation. -- Benjamin Franz
Received on Friday, 6 March 1998 08:18:25 UTC