- From: <Yoshito_Umaoka@lotus.co.jp>
- Date: Thu, 11 Apr 2002 11:56:26 -0400
- To: www-international@w3.org
- Message-ID: <OFC293AEF0.7C6751C9-ON85256B98.004C8025-85256B98.0057962E@lotus.com>
Hi there, >Hello. I need help in reading Korean (and Japanese) characters >arriving at the server via HTTP. The data is in response to text input >fields on an HTML form. I am receiving some characters that in the >HTTP input stream show as things like %2354466; I guess the HTTP input form data was generated by a certain version of MS IE and it was actually "%26%2354466;". When an HTML form has an explicit charset description ("charset=xxx" either in the HTTP content-type header or httpequiv for the HTTP content-type set in META), a Web user agent should return the form input data encoded in the HTML form charset. When the form is encoded by Western charset such as "ISO-8859-1", Korean (or Japanese) character data cannot be sent back to the server. There is no standard way defined for this. I experienced that a certain version of MS Internet Explorer applies "NCR" encoding for the Korean (or Japanese) character in the case. "%26%23%54466;" actually means "퓂" which represents a code point 54466 (U+D4C2). However, this is Microsoft's proprietary implementation and other Web browsers do not work like this. I think what you should do to get Korean (or Japanese) form input is to make an HTML form in Korean (or Japanese) charset. If you need to design your Web application to get any text input, you may want to use UTF-8 as the HTML form encoding, so you can get the form input data in UTF-8. The HTML4.01 recommendation explains how to process HTML input form data in section 17.13.3. http://www.w3.org/TR/html4/interact/forms.html#h-17.13.3 Note: In the HTML4.01 recommendation, you can set "accept-charset" attribute in <form> tag. However, I don't know any Web browser software which support the attribute properly. Personally, the content-type header used for HTML form data submission should describe the charset encoding used for the input form data set when the data is POSTed in "application/x-www-form-urlencoded". (Or, should use the encoding defined in IRI?) >I have found that the %23 is a # sign and that subtracting 65536 from >the remaining 5 character number and then taking the ChrW of the >result gives me the right ideograph for the many that I have tested. >Is this all there is to it? Are there limits to the algorithm such that the >subtract 65536 algorithm only works for a certain range of these >characters and some other calculation is needed for others? I do not know what "ChrW" is. But it seems the function just convert the decimal value to a Unicode character. But I guess this does not work well on Web browsers other than MS IE. (When "multipart/form-data" is used for a form data set, it does not work well even on the latest MS IE). So I think you should select a proper HTML form charset matching input language (or to use a universal charset like UTF-8) to avoid the situation as I wrote above. -Yoshito Umaoka
Received on Thursday, 11 April 2002 11:57:03 UTC