Re: what should the charset be in the response to the server from Chris Haynes on 2003-07-26 (www-international@w3.org from July to September 2003)

From: Chris Haynes <chris@harvington.org.uk>
Date: Sat, 26 Jul 2003 09:14:13 +0100
To: <www-international@w3.org>
Message-ID: <006301c3534d$e66f8000$0200000a@ringo>
At  Saturday, July 26, 2003 12:29 AM "Shigemichi Yazawa" wrote:

>
> At Fri, 25 Jul 2003 16:11:17 -0400 (EDT),
> Jungshik Shin wrote:
> >
> >   For a while (before 1.0), Mozilla added 'charset' parameter to
> > Content-Type header  with application/x-www-form-urlencoded, but
> > it broke a lot of CGI programs and was removed later.
> > (see http://bugzilla.mozilla.org/show_bug.cgi?id=18643)
>
> This is a very good example that shows that any standard must be
> created with i18n in mind. It's very hard to change it after it's
out
> to the public.
>
> >   If you specify ENCTYPE="multipart/form-data" in FORM, you'll
> > get charset parameter specified in each part of
'multipart/form-data'
> > if necessary.
> > See http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.2
>
> I tried this, but neither Mozilla 1.0 nor IE 6.0 add content-type
> header in any part. I see something like this.
>
> -----------------------------75689853717345751981973594324
> Content-Disposition: form-data; name="i18n"
>
> test string
> -----------------------------75689853717345751981973594324--
>
> If browsers support charset parameter in multipart/form-data, it
would
> disambiguate the character encoding, although getting input values
> become a little cumbersome (you can't use the convenient
> getParameter() method).
>
> -------------------
> Shigemichi Yazawa


I was the inadvertent initiator of this thread, and I was mortified to
find that the advice I gave was wrong - I had taken it on trust from
someone else that the Content-Encoding field is set by user agents
when sending a POST which includes encoded content (why isn't it?).

 My own tests in the last couple of days have confirmed that I was
wrong.

The discussion contained in the mozilla bug report referenced above
was helpful. Microsoft's (and then Mozilla's) use of the '_charset_'
parameter to carry the missing character encoding information from
user agent to server looked as if it could be the 'missing link' in
what I had thought was happening.

In brief Microsoft's approach to this problem appears to be the
following: If a hidden field

    <input type='hidden' name='_charset_'>

is included in the form, and the form is to be submitted using POST,
the user agent will insert the indicator of the character encoding
being used as the value of this parameter.

However, this morning I have tested '_charset_' support on three
browsers running under W2K.

MSIE 6.0 does support this feature (i.e. the query string includes the
name-value pair "_charset_=UTF-8")
Netscape Navigator 6.2.2 does not
Opera 7.11 does not.

The _charset_ mechanism cannot be used by Servlets running in a
standard Servlet container, as, by the time parameter values are made
available to the Servlet, the decision on the character decoding to
apply has already been committed by the container.

I conclude that the '_charset_' mechanism, although ingenious, is a
non-standard, proprietary distraction.

The only standards-based way of being _sure_ what character encoding
has been applied to form data appears to be to use

    <form action=... method='post' accept-charset='UTF-8'>

(or whatever the page author's chosen character encoding is).

Although the server is still not given any positive confirmation of
the encoding used, my initial tests on the above three browsers
suggest that:

1) They do indeed apply the UTF-8 encoding,

2) Attempts by the Browser's user to change the encoding using the
menu control are ignored.

Interestingly, changing from 'post' to 'get' in MSIE6 re-enables the
user's control over the encoding used (i.e. over the values now
transmitted in the URI query string). I have not tested this with the
other two browsers.

Also, I have confirmed the original observation that the encoding used
in 'post'_is_ affected by the user's menu selections if the
'accept-charset' is not present (in MSIE6).


The only other reliable transfer mechanism available would appear to
be  the ENCTYPE="multipart/form-data" method being discussed in this
same thread, but this format is not decoded by standard Servlet
containers, so the convenient HttpRequest.getParameter() Servlet API
could not be used with this mechanism.


My tentative conclusion is that the only reliable, standards-based
way of receiving international form data in query-string format from
user agents supporting HTML 4+  is to:

1) Use POST
2) Specify one (and only one) character encoding in the form's
'accept-charset' attribute
3) Use this character encoding to decode the request at the server.

Can anyone see any flaws in this conclusion?

Chris Haynes
Received on Saturday, 26 July 2003 04:23:15 UTC