- From: Chris Haynes <chris@harvington.org.uk>
- Date: Sat, 26 Jul 2003 09:14:13 +0100
- To: <www-international@w3.org>
At Saturday, July 26, 2003 12:29 AM "Shigemichi Yazawa" wrote: > > At Fri, 25 Jul 2003 16:11:17 -0400 (EDT), > Jungshik Shin wrote: > > > > For a while (before 1.0), Mozilla added 'charset' parameter to > > Content-Type header with application/x-www-form-urlencoded, but > > it broke a lot of CGI programs and was removed later. > > (see http://bugzilla.mozilla.org/show_bug.cgi?id=18643) > > This is a very good example that shows that any standard must be > created with i18n in mind. It's very hard to change it after it's out > to the public. > > > If you specify ENCTYPE="multipart/form-data" in FORM, you'll > > get charset parameter specified in each part of 'multipart/form-data' > > if necessary. > > See http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.2 > > I tried this, but neither Mozilla 1.0 nor IE 6.0 add content-type > header in any part. I see something like this. > > -----------------------------75689853717345751981973594324 > Content-Disposition: form-data; name="i18n" > > test string > -----------------------------75689853717345751981973594324-- > > If browsers support charset parameter in multipart/form-data, it would > disambiguate the character encoding, although getting input values > become a little cumbersome (you can't use the convenient > getParameter() method). > > ------------------- > Shigemichi Yazawa I was the inadvertent initiator of this thread, and I was mortified to find that the advice I gave was wrong - I had taken it on trust from someone else that the Content-Encoding field is set by user agents when sending a POST which includes encoded content (why isn't it?). My own tests in the last couple of days have confirmed that I was wrong. The discussion contained in the mozilla bug report referenced above was helpful. Microsoft's (and then Mozilla's) use of the '_charset_' parameter to carry the missing character encoding information from user agent to server looked as if it could be the 'missing link' in what I had thought was happening. In brief Microsoft's approach to this problem appears to be the following: If a hidden field <input type='hidden' name='_charset_'> is included in the form, and the form is to be submitted using POST, the user agent will insert the indicator of the character encoding being used as the value of this parameter. However, this morning I have tested '_charset_' support on three browsers running under W2K. MSIE 6.0 does support this feature (i.e. the query string includes the name-value pair "_charset_=UTF-8") Netscape Navigator 6.2.2 does not Opera 7.11 does not. The _charset_ mechanism cannot be used by Servlets running in a standard Servlet container, as, by the time parameter values are made available to the Servlet, the decision on the character decoding to apply has already been committed by the container. I conclude that the '_charset_' mechanism, although ingenious, is a non-standard, proprietary distraction. The only standards-based way of being _sure_ what character encoding has been applied to form data appears to be to use <form action=... method='post' accept-charset='UTF-8'> (or whatever the page author's chosen character encoding is). Although the server is still not given any positive confirmation of the encoding used, my initial tests on the above three browsers suggest that: 1) They do indeed apply the UTF-8 encoding, 2) Attempts by the Browser's user to change the encoding using the menu control are ignored. Interestingly, changing from 'post' to 'get' in MSIE6 re-enables the user's control over the encoding used (i.e. over the values now transmitted in the URI query string). I have not tested this with the other two browsers. Also, I have confirmed the original observation that the encoding used in 'post'_is_ affected by the user's menu selections if the 'accept-charset' is not present (in MSIE6). The only other reliable transfer mechanism available would appear to be the ENCTYPE="multipart/form-data" method being discussed in this same thread, but this format is not decoded by standard Servlet containers, so the convenient HttpRequest.getParameter() Servlet API could not be used with this mechanism. My tentative conclusion is that the only reliable, standards-based way of receiving international form data in query-string format from user agents supporting HTML 4+ is to: 1) Use POST 2) Specify one (and only one) character encoding in the form's 'accept-charset' attribute 3) Use this character encoding to decode the request at the server. Can anyone see any flaws in this conclusion? Chris Haynes
Received on Saturday, 26 July 2003 04:23:15 UTC