Re: query - internationalization

On Feb 12, 2004, at 9:54 AM, Jon Hanna wrote:

I'd be way cautious about saying its possible to determine the encoding 
by examining the data.

I think it would be safe to say that given an encoding X and the data 
it is often possible to make statements like:
	"This data is definitely not in encoding X"
	"This data could be in encoding X"

For instance given a stream of bytes, you might be able to state "This 
data does not conform to UTF-8" or "This data is not valid as 
Shift-JIS"

Programatically determining the encoding without external markers (such 
as HTTP headers) or internal markers (such as BOM) is in general quite 
difficult particularly with small data samples.

A human with some knowledge of encodings and languages might be able to 
determine the encoding of, for instance, a given file with 90+% 
accuracy, but I'm not aware of any program or library that could claim 
anywhere close to this accuracy.

Lllloyd


>
> If this isn't possible to do that then while it is possible to 
> determine the
> encoding by examining the data this can be expensive, and also 
> ambiguous in
> some cases (in particular in a stream of text which is in one of the 
> encodings
> where the first 128 code points coincides with that of US-ASCII - 
> examples
> including UTF-8 and all of the ISO 8859 family - if only one or two 
> characters
> from outside of that range occurs in the text). I'd try really hard to 
> avoid
> going down that route unless absolutely necessary.

Received on Thursday, 12 February 2004 11:43:54 UTC