Re: charset issues

In regard to character sets I'd like to see the following happen:

In the long run all browsers would be able to deal with a document tagged
with the charset utf-8, along with a limited well-known set of popular
charsets (listed in a new RFC-yet-to-be-writte).  In this world the browser
would not send Accept-Charset because it can literally accept anything and
deal with it in some reasonable way.

In the shorter run, browsers that can deal with utf-8 would send
Accept-Charset with two charsets listed: 1) utf-8, 2) a character set that
the browser can deal effectively with. This should be interpreted as "I can
deal with utf-8 appropriately, lacking that I like charset 2) the best, but
if all else fails, send me any encoding and I'll do my best." Of course all
documents should be tagged with the actual charset.

In regard to POSTed data, there are good solutions and browser/server
vendors need to agree on one (or more) as soon as possible.

Lacking any new mechanisms, servers should assume the form data is in the
same encoding as the form document. Maybe this is the state-of-the-art
anyway. In any case, this is obviously inefficient because if the server
serves documents with different encodings it places an undue burden on the
server. It would be better to tag the return data so that the server does
not need to look at the original document.

I've read about or imagined several ways this can be done:

1) Use the mechanism proposed by Larry Masinter in multipart/form-data.

2) Use "application/x-www-form-urlencoded ; charset=<one of the well-known
encodings>"

3) Create a new Media Type that is identical to
application/x-www-form-urlencoded but allows a charset parameter

4) Use the "charset field" approach (a charset is sent back as the value
for a special field that is hidden from the user)

5) Send an Accept-Charset on the POST

All of these solutions pose difficulties in regard to compatibility. In the
longer run 1) seems like a very flexible solution. In the shorter run, my
hope would be that 2) or 4) or 5) would work in that servers would ignore
the parameter/field if they didn't know how to deal with it. If the server
did know about the parameter, it would also be able to deal with all the
popular charsets in the RFC yet-to-be-written. 3) seems like a bad
idea--nobody needs a new Media Type.

Servers should be updated to be able to deal with all the encodings listed
in the RFC-yet-to-be-written. (Until it is written, all the charsets in the
HTTP 1.1 appendix.) I'm against the server sending Accept-Charset. Instead
servers should be updated to handle the encodings.

As a matter of style, I like 2) better than 4) or 5) because it seems more
consistent, but they are all functionally identical. I would hope we could
agree to use only one, but if quick agreement proves impractical it would
be best to support 2), 4), and 5).

In any case, properly-tagged utf-8 should be used if 8859-1 can't support
the characters needed. I think this is best long-term solution.

Received on Friday, 20 December 1996 14:47:12 UTC