- From: Lloyd Honomichl <lloyd@honomichl.com>
- Date: Thu, 12 Feb 2004 11:43:31 -0500
- To: Jon Hanna <jon@hackcraft.net>
- Cc: Varun <mvarun@cisco.com>, "www-international@w3.org" <www-international@w3.org>
On Feb 12, 2004, at 9:54 AM, Jon Hanna wrote: I'd be way cautious about saying its possible to determine the encoding by examining the data. I think it would be safe to say that given an encoding X and the data it is often possible to make statements like: "This data is definitely not in encoding X" "This data could be in encoding X" For instance given a stream of bytes, you might be able to state "This data does not conform to UTF-8" or "This data is not valid as Shift-JIS" Programatically determining the encoding without external markers (such as HTTP headers) or internal markers (such as BOM) is in general quite difficult particularly with small data samples. A human with some knowledge of encodings and languages might be able to determine the encoding of, for instance, a given file with 90+% accuracy, but I'm not aware of any program or library that could claim anywhere close to this accuracy. Lllloyd > > If this isn't possible to do that then while it is possible to > determine the > encoding by examining the data this can be expensive, and also > ambiguous in > some cases (in particular in a stream of text which is in one of the > encodings > where the first 128 code points coincides with that of US-ASCII - > examples > including UTF-8 and all of the ISO 8859 family - if only one or two > characters > from outside of that range occurs in the text). I'd try really hard to > avoid > going down that route unless absolutely necessary.
Received on Thursday, 12 February 2004 11:43:54 UTC