Re: specification for form sumission from Martin Duerst on 2002-02-19 (www-international@w3.org from January to March 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 20 Feb 2002 08:47:14 +0900
To: ftang@netscape.com (Yung-Fong Tang), www-international <www-international@w3.org>, Katsuhiko Momoi <momoi@netscape.com>, Bob Jung <bobj@netscape.com>
Message-Id: <4.2.0.58.J.20020220083128.00a87eb8@localhost>

Hello Frank,

At 08:50 02/02/19 -0800, Yung-Fong Tang wrote:
>I wonder is there a w3c specification address the following issue:

In summary, no, but XForms should provide it. Please review
the XForms WD, at http://www.w3.org/TR/xforms/, in last call
until the end of this week.

>Background:
>All HTML could encoded with a charset, either by labeled by HTTP header or 
>HTML meta tag. When the browser submit the form data to the server, for 
>backward compatability reason, we should send the data in the url escaped 
>form of the form charset.

Yes, this is what the spec says, and what (reasonably newer) browsers do.

>However, since it is possible to put any unicode data into the text feild, 
>what should the browser do when the data it need to submit cannot be 
>convert to the charset of the form html.
>
>I observed/heard about the following behavior:
>1. prohibit the input, copy and paste of any characters which cannot be 
>convert to the charset- Netscape 4.x did that. So there are no way to put 
>Korean characters into ISO-8859-1 form. In this case, what you see is what 
>you submit.

This is most straightforward. Presumably, the CGI (or whatever) working
in iso-8859-1, and sending it anything else will get it confused.
This (plus 3 maybe) is probably what I would do.

>2. replace characters cannot be submit to '?' (N6.2 do that)

Not such a good idea; the user things she submitted actual
characters, but they didn't get across. Imagine ordering
something, and typing in your address, and being billed, but
never getting anything because the post office cannot route
a package to ?????.

>3. if there are ACCEPT_CHARSET specified in the HTML form , try to convert 
>to different charset. (HTML 4.x say something about this). However, it 
>will be very bad if one value is in one charset and the other is in a 
>different one.

The original assumption for ACCEPT_CHARSET was that all browsers would
use it, so the server would always only see a single encoding. However,
uptake on ACCEPT_CHARSET was very slow; I actually don't know positively
about any browser where it is implemented. The only way I would suggest
it might be used now is:
- Only use it with a value of UTF-8.
- Only use it if you have server logic that allows to distinguish between
   UTF-8 and the encoding of the page.
On the browser side, implementing ACCEPT_CHARSET is not a bad idea,
because if the form is using it, you are okay to assume that the
server can deal with what it's asking for.

>4. try to convert to UTF-8 if that happen. Same issue as above, we don't 
>want to see one value in one charset and the other one in a different one.

Well, UTF-8 can be distinguished from other encodings quite easily,
but you have no guarantee that the server will be able to deal with
UTF-8, so it's better not to send UTF-8.

>5. convert it to the form charset, and for those character cannot be 
>converted, conver it to NCR &#12345; and then % escaped (the IE6 on my 
>WinXP do that)

So that would be %26%2312345%3B, or something else? How do you know that
the server is able to deal with this?

Regards,    Martin.

Received on Tuesday, 19 February 2002 19:02:22 UTC