W3C home > Mailing lists > Public > www-international@w3.org > January to March 2004

Re: query - internationalization

From: Lloyd Honomichl <lloyd@honomichl.com>
Date: Thu, 12 Feb 2004 11:43:31 -0500
Message-Id: <976264DE-5D7A-11D8-B4C0-000A95B963B8@honomichl.com>
Cc: Varun <mvarun@cisco.com>, "www-international@w3.org" <www-international@w3.org>
To: Jon Hanna <jon@hackcraft.net>

On Feb 12, 2004, at 9:54 AM, Jon Hanna wrote:

I'd be way cautious about saying its possible to determine the encoding 
by examining the data.

I think it would be safe to say that given an encoding X and the data 
it is often possible to make statements like:
	"This data is definitely not in encoding X"
	"This data could be in encoding X"

For instance given a stream of bytes, you might be able to state "This 
data does not conform to UTF-8" or "This data is not valid as 

Programatically determining the encoding without external markers (such 
as HTTP headers) or internal markers (such as BOM) is in general quite 
difficult particularly with small data samples.

A human with some knowledge of encodings and languages might be able to 
determine the encoding of, for instance, a given file with 90+% 
accuracy, but I'm not aware of any program or library that could claim 
anywhere close to this accuracy.


> If this isn't possible to do that then while it is possible to 
> determine the
> encoding by examining the data this can be expensive, and also 
> ambiguous in
> some cases (in particular in a stream of text which is in one of the 
> encodings
> where the first 128 code points coincides with that of US-ASCII - 
> examples
> including UTF-8 and all of the ISO 8859 family - if only one or two 
> characters
> from outside of that range occurs in the text). I'd try really hard to 
> avoid
> going down that route unless absolutely necessary.
Received on Thursday, 12 February 2004 11:43:54 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:21 UTC