Re: question from Yoshito_Umaoka@lotus.co.jp on 2002-04-11 (www-international@w3.org from April to June 2002)

From: <Yoshito_Umaoka@lotus.co.jp>
Date: Thu, 11 Apr 2002 11:56:26 -0400
To: www-international@w3.org
Message-ID: <OFC293AEF0.7C6751C9-ON85256B98.004C8025-85256B98.0057962E@lotus.com>

Hi there,

>Hello. I need help in reading Korean (and Japanese) characters
>arriving at the server via HTTP. The data is in response to text input
>fields on an HTML form. I am receiving some characters that in the
>HTTP input stream show as things like %2354466;

I guess the HTTP input form data was generated by a certain
version of MS IE and it was actually "%26%2354466;".

When an HTML form has an explicit charset description
("charset=xxx" either in the HTTP content-type header or 
httpequiv for the HTTP content-type set in META), a Web 
user agent should return the form input data encoded in 
the HTML form charset.  When the form is encoded by Western
charset such as "ISO-8859-1", Korean (or Japanese)
character data cannot be sent back to the server.  There is
no standard way defined for this.

I experienced that a certain version of MS Internet Explorer
applies "NCR" encoding for the Korean (or Japanese) character
in the case. "%26%23%54466;" actually means "&#54466;" which
represents a code point 54466 (U+D4C2).

However, this is Microsoft's proprietary implementation and
other Web browsers do not work like this.

I think what you should do to get Korean (or Japanese) form
input is to make an HTML form in Korean (or Japanese) charset.
If you need to design your Web application to get any text
input, you may want to use UTF-8 as the HTML form encoding,
so you can get the form input data in UTF-8.

The HTML4.01 recommendation explains how to process HTML
input form data in section 17.13.3.

http://www.w3.org/TR/html4/interact/forms.html#h-17.13.3

Note: In the HTML4.01 recommendation, you can set
"accept-charset" attribute in <form> tag.  However, I don't
know any Web browser software which support the attribute
properly.

Personally, the content-type header used for HTML form data
submission should describe the charset encoding used for
the input form data set when the data is POSTed in
"application/x-www-form-urlencoded".  (Or, should use the
encoding defined in IRI?)

>I have found that the %23 is a # sign and that subtracting 65536 from
>the remaining 5 character number and then taking the ChrW of the
>result gives me the right ideograph for the many that I have tested.
>Is this all there is to it? Are there limits to the algorithm such that 
the
>subtract 65536 algorithm only works for a certain range of these
>characters and some other calculation is needed for others?

I do not know what "ChrW" is.  But it seems the function just
convert the decimal value to a Unicode character.

But I guess this does not work well on Web browsers other
than MS IE.  (When "multipart/form-data" is used for a form
data set, it does not work well even on the latest MS IE).

So I think you should select a proper HTML form charset
matching input language (or to use a universal charset like
UTF-8) to avoid the situation as I wrote above.

-Yoshito Umaoka

Received on Thursday, 11 April 2002 11:57:03 UTC