- From: Addison Phillips [wM] <aphillips@webmethods.com>
- Date: Thu, 12 Feb 2004 08:55:46 -0800
- To: "Varun (by way of Martin Duerst <duerst@w3.org>)" <mvarun@cisco.com>, <www-international@w3.org>
Hi Varun, Your question is a bit vague. A lot of the specifics depend on what "varied sources" means, how you are receiving data, and how you will present it. Let's assume that you are receiving the data via HTTP. In order for the data to have any utility, the sender must tell you what the content is. The HTTP header has a field "Content-Type" that tells you what the content is supposed to be and that field either will contain an explicit "charset" attribute or it will be implied by the MIME type you find there. See RFC2277, RFC2045, etc. etc. If the Content-Type does not contain the charset or you are not receiving the data via HTTP, sometimes the data itself will indicate the charset. This is especially true of XML files. In some cases you cannot rely on content-type to be declared, so you may need the source to tell you the encoding of the file. For example, in files uploaded on an HTML FORM, you should include an additional field for the user to indicate the character encoding of uploaded content. If you don't have a charset from the source, guessing is bound to lead to errors. You can reliably test for about one encodings: US-ASCII. (You may also be able to have a pretty high assurance of detecting UTF-8 because it is very highly patterned in ways that other encodings are not.) All other encodings are, at best, an educated guess. I recommend against guessing. If you cannot get the encoding from the source, store the bytes. Of course, this poses a problem for later display..... You can transcode any input stream to a Unicode encoding form, such as UTF-8 or UTF-16, provided you know the encoding. Then you can transcode that to the target encoding your end users want (although serving Unicode is a better choice, in my opinion). The character encoding of the source will determine what additional precautions are necessary. Hope that helps. Best Regards, Addison Addison P. Phillips Director, Globalization Architecture webMethods | Delivering Global Business Visibility http://www.webMethods.com Chair, W3C Internationalization (I18N) Working Group Chair, W3C-I18N-WG, Web Services Task Force http://www.w3.org/International Internationalization is an architecture. It is not a feature. > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org]On Behalf Of Varun (by way of > Martin Duerst <duerst@w3.org>) > Sent: jeudi 12 fevrier 2004 06:25 > To: www-international@w3.org > Subject: query - internationalization > > > > > > > Hello, > > I have an application which stores data from varied sources which > send data > in differing encodings. > However, coming from the application, its users want a consistent encoding > format. > since it is hard to convince diverse clients to change and send data in a > uniform format, i would appreciate to receive pointers to the following: > > - a technique to detect the encoding format of an input stream, and > - a technique to automatically convert various formats to a standard > encoding - say utf8. > > Thanks in advance for the help, > Varun Mathur
Received on Thursday, 12 February 2004 14:43:17 UTC