Re: query - internationalization from Lloyd Honomichl on 2004-02-12 (www-international@w3.org from January to March 2004)

From: Lloyd Honomichl <lloyd@honomichl.com>
Date: Thu, 12 Feb 2004 11:43:31 -0500
To: Jon Hanna <jon@hackcraft.net>
Cc: Varun <mvarun@cisco.com>, "www-international@w3.org" <www-international@w3.org>
Message-Id: <976264DE-5D7A-11D8-B4C0-000A95B963B8@honomichl.com>

On Feb 12, 2004, at 9:54 AM, Jon Hanna wrote:

I'd be way cautious about saying its possible to determine the encoding 
by examining the data.

I think it would be safe to say that given an encoding X and the data 
it is often possible to make statements like:
	"This data is definitely not in encoding X"
	"This data could be in encoding X"

For instance given a stream of bytes, you might be able to state "This 
data does not conform to UTF-8" or "This data is not valid as 
Shift-JIS"

Programatically determining the encoding without external markers (such 
as HTTP headers) or internal markers (such as BOM) is in general quite 
difficult particularly with small data samples.

A human with some knowledge of encodings and languages might be able to 
determine the encoding of, for instance, a given file with 90+% 
accuracy, but I'm not aware of any program or library that could claim 
anywhere close to this accuracy.

Lllloyd

>
> If this isn't possible to do that then while it is possible to 
> determine the
> encoding by examining the data this can be expensive, and also 
> ambiguous in
> some cases (in particular in a stream of text which is in one of the 
> encodings
> where the first 128 code points coincides with that of US-ASCII - 
> examples
> including UTF-8 and all of the ISO 8859 family - if only one or two 
> characters
> from outside of that range occurs in the text). I'd try really hard to 
> avoid
> going down that route unless absolutely necessary.

Received on Thursday, 12 February 2004 11:43:54 UTC