- From: Jon Hanna <jon@hackcraft.net>
- Date: Thu, 12 Feb 2004 14:54:32 +0000
- To: Varun <mvarun@cisco.com>
- Cc: "www-international@w3.org" <www-international@w3.org>
Quoting Varun <mvarun@cisco.com>, "by way of Martin Duerst <duerst@w3.org>"@: [snip] > > - a technique to detect the encoding format of an input stream, and Ideally the client should signal the encoding out-of-band. For instance HTTP uses headers to indicate the encoding of the content. Otherwise the format could insist on beginning with an indicator of the format (c.f. the XML declarations in XML). If you are not using HTTP for communication and not using XML for the content then you may be able to borrow one of those techniques into your application. If this isn't possible to do that then while it is possible to determine the encoding by examining the data this can be expensive, and also ambiguous in some cases (in particular in a stream of text which is in one of the encodings where the first 128 code points coincides with that of US-ASCII - examples including UTF-8 and all of the ISO 8859 family - if only one or two characters from outside of that range occurs in the text). I'd try really hard to avoid going down that route unless absolutely necessary. > - a technique to automatically convert various formats to a standard > encoding - say utf8. Lots of libraries exist to deal with this. Windows has some functions built into the OS, ICU <http://oss.software.ibm.com/icu/> has functions to deal with this. Other libraries and open-source code can be found on the web. -- Jon Hanna <http://www.hackcraft.net/> *Thought provoking quote goes here*
Received on Thursday, 12 February 2004 09:54:36 UTC