Re: Using unicode or MBCS characters in forms from Gavin Nicol on 1996-06-21 (www-international@w3.org from April to June 1996)

From: Gavin Nicol <gtn@ebt.com>
Date: Fri, 21 Jun 1996 02:15:46 GMT
To: erik@netscape.com
Cc: JMHX.DSKPO33C@dskbgw1.itg.ti.com, www-international@w3.org
Message-Id: <199606210215.CAA02453@wiley.EBT.COM>

>We do send Accept-Language, if the user sets it.

I use 2.01 under Unix. How do I set it?
 
>Assuming "POSTed data" refers to forms, I haven't seen very many forms
>asking for an ENCTYPE of "multipart/form-data". The most common enctype
>appears to be "application/x-www-form-urlencoded", the default. I
>suppose we could add "; charset=xxx" to the content-type header, though
>things appear to work as they are (i.e. the client sends the results
>back in the same encoding as the original form). 

It would be *very* useful to have the client add a charset parameter
(see below). 

>Can all servers deal with an appended charset parameter? We would
>need to investigate this before adding charset. Discussing these
>things on this mailing list is all very well, but actually making
>changes to our software requires a lot of care.

This "bugward combatibility" is one of the primary reasons things
haven't changed. This and the "well it seems to work now" attitude. 
You could at least make it an option.

>> As it is now, any data recieved must be sniffed to figure out what it
>> is. Not very useful on a site that could get queries in both EUC-KR
>> and EUC-JP... even shift-jis and EUC can be mistaken.
> 
>I suppose it would be nice if people could submit EUC-KR data even if
>the original form is in EUC-JP or ISO-8859-1. Currently, people seem to
>get by with results sent back in the same encoding as the form itself.
>Haven't heard too many complaints about this.

What happens if you have, on a single site, many different forms in
many different encodings? What happens if the forms are dynamically
generated, where you do not know a priori what the encoding of the
form is/was? Then you have to rely on data sniffing, in which case it
is not easy to distinguish EUC-KR and EUC-JP. Data sniffing would also
be simplified if a single encoding was choosen for each language.

The current situation can be made to work if you assume a single
language, and a single primary encoding. It fails when you try to
create truly multilingual sites.

>> It is more than a year and a half since I pointed this out.
> 
>Right, so I guess this is not really a much demanded feature.

I don't think so. The I18N discussions are quite old now, and are
probably the one area where software vendors are almost uniformly
poor. Until recently, the number of web sites in Japan was small, and
now they are exploding. The problems we have now will be magnified
many times over.

>Under Options -> Document Encoding, you will find a list of charsets.
>The actual spelling of the various charset names can be found in
> 
>  ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets

So do you recognise EUC-JP (which I believe is not in the IANA list,
except as Extended_UNIX_Code_Fixed_Width_for_Japanese).

Also, even when I send the parameter, why does the document info forms
tell me the encoding is iso-20220-jp?

Received on Thursday, 20 June 1996 22:17:43 UTC