Re: browsers sending numeric character references.

 "Michael Monaghan" asked:.

> I've been testing what happens when you input say, Japanese text into a form
on an ISO-1
> encoded page.
..>
> Is encoding the query string using character entities  covered in some IETF/W3
spec?


My understanding is that, if you just declare the character encoding of the page
in which the form is written, there are 'hints' in the W3C HTML 4.01 spec that
that same encoding should be used when submitting the form data but, to my
reading, it is not mandated.

However I have concluded that you should never rely on the page encoding to
infer the encoding used in submitting form data, because users can use their
browser menu to change the encoding in use (on MSIE: View - Encoding - ...).

When using POST there should be a record of the encoding that the browser used
in one of the HTTP or MIME headers, but with GET there is no mechanism by which
the browser can communicate the encoding it used.

However, if you specify a single encoding in an ACCEPT-CHARSET attribute in the
declaration of the form itself, on all the browsers I had access to that
encoding was used - with NCRs where necessary.  Usefully, this applied to forms
submitted using POST _and_ using GET (i.e. the URL query part was encoded using
the specified encoding). I now always use UTF-8 here, even if the page
containing the form is encoded in some other encoding, so that I _know_ how to
decode the NICs (and all the other octets).

 The <form accept-charset='UTF-8' ... > approach appears to override the
influence of the user's "View-Encoding-" control and seems to provide the _only_
mechanism by which you can be sure how the characters in the form data have been
encoded for both POST and GET.

This worked on all the 6th/7th generation Win32 and Linux browsers I had access
to.

I was not able to test this on the Mac browsers; it would be interesting to know
if this works on them as well.

Chris Haynes

Received on Saturday, 3 April 2004 03:49:33 UTC