RE: query - internationalization from Addison Phillips [wM] on 2004-02-12 (www-international@w3.org from January to March 2004)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Thu, 12 Feb 2004 08:55:46 -0800
To: "Varun (by way of Martin Duerst <duerst@w3.org>)" <mvarun@cisco.com>, <www-international@w3.org>
Message-ID: <PNEHIBAMBMLHDMJDDFLHMECMHKAA.aphillips@webmethods.com>

Hi Varun,

Your question is a bit vague. A lot of the specifics depend on what "varied
sources" means, how you are receiving data, and how you will present it.

Let's assume that you are receiving the data via HTTP. In order for the data
to have any utility, the sender must tell you what the content is. The HTTP
header has a field "Content-Type" that tells you what the content is
supposed to be and that field either will contain an explicit "charset"
attribute or it will be implied by the MIME type you find there. See
RFC2277, RFC2045, etc. etc.

If the Content-Type does not contain the charset or you are not receiving
the data via HTTP, sometimes the data itself will indicate the charset. This
is especially true of XML files. In some cases you cannot rely on
content-type to be declared, so you may need the source to tell you the
encoding of the file. For example, in files uploaded on an HTML FORM, you
should include an additional field for the user to indicate the character
encoding of uploaded content.

If you don't have a charset from the source, guessing is bound to lead to
errors. You can reliably test for about one encodings: US-ASCII. (You may
also be able to have a pretty high assurance of detecting UTF-8 because it
is very highly patterned in ways that other encodings are not.) All other
encodings are, at best, an educated guess.

I recommend against guessing. If you cannot get the encoding from the
source, store the bytes. Of course, this poses a problem for later
display.....

You can transcode any input stream to a Unicode encoding form, such as UTF-8
or UTF-16, provided you know the encoding. Then you can transcode that to
the target encoding your end users want (although serving Unicode is a
better choice, in my opinion). The character encoding of the source will
determine what additional precautions are necessary.

Hope that helps.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Varun (by way of
> Martin Duerst <duerst@w3.org>)
> Sent: jeudi 12 fevrier 2004 06:25
> To: www-international@w3.org
> Subject: query - internationalization
>
>
>
>
>
>
> Hello,
>
> I have an application which stores data from varied sources which
> send data
> in differing encodings.
> However, coming from the application, its users want a consistent encoding
> format.
> since it is hard to convince diverse clients to change and send data in a
> uniform format, i would appreciate to receive pointers to the following:
>
> - a technique to detect the encoding format of an input stream, and
> - a technique to automatically convert various formats to a standard
> encoding - say utf8.
>
> Thanks in advance for the help,
> Varun Mathur

Received on Thursday, 12 February 2004 14:43:17 UTC