W3C home > Mailing lists > Public > www-international@w3.org > January to March 2004

Re: query - internationalization

From: Jon Hanna <jon@hackcraft.net>
Date: Thu, 12 Feb 2004 14:54:32 +0000
Message-ID: <1076597672.402b93a8a159f@82.195.128.192>
To: Varun <mvarun@cisco.com>
Cc: "www-international@w3.org" <www-international@w3.org>

Quoting Varun <mvarun@cisco.com>, "by way of Martin Duerst <duerst@w3.org>"@:
[snip]
> 
> - a technique to detect the encoding format of an input stream, and

Ideally the client should signal the encoding out-of-band. For instance HTTP
uses headers to indicate the encoding of the content. Otherwise the format
could insist on beginning with an indicator of the format (c.f. the XML
declarations in XML).

If you are not using HTTP for communication and not using XML for the content
then you may be able to borrow one of those techniques into your application.

If this isn't possible to do that then while it is possible to determine the
encoding by examining the data this can be expensive, and also ambiguous in
some cases (in particular in a stream of text which is in one of the encodings
where the first 128 code points coincides with that of US-ASCII - examples
including UTF-8 and all of the ISO 8859 family - if only one or two characters
from outside of that range occurs in the text). I'd try really hard to avoid
going down that route unless absolutely necessary.

> - a technique to automatically convert various formats to a standard
> encoding - say utf8.

Lots of libraries exist to deal with this. Windows has some functions built into
the OS, ICU <http://oss.software.ibm.com/icu/> has functions to deal with this.
Other libraries and open-source code can be found on the web.

-- 
Jon Hanna
<http://www.hackcraft.net/>
*Thought provoking quote goes here*
Received on Thursday, 12 February 2004 09:54:36 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 5 February 2014 07:14:11 UTC