Re: [Moderator Action] Using unicode in CGI programs from Erik van der Poel on 2000-02-09 (www-international@w3.org from January to March 2000)

Forwarded message 1

From: Erik van der Poel <erik@netscape.com>
Date: Tue, 09 Nov 1999 11:11:34 -0800
Subject: Re: HTML forms and UTF-8
To: Glen Perkins <Glen.Perkins@nativeguide.com>
CC: Unicode List <unicode@unicode.org>
Message-ID: <382871E6.6BADF692@netscape.com>
Glen Perkins wrote:
> 
> I'd really like to take the Right Path of encoding the form in UTF-8 and
> having it return the form data in UTF-8, so I could have a generic solution
> of any language(s) going out and any language(s) coming back. It really does
> have to work, though, or else the people I do it for, who don't know much
> about i18n and therefore hate and oppose it, will say "See! We told you it
> was a bad idea!" Urrrgh.
> 
> Do you know under what circumstances this is likely to work? Would it work,
> say, for both IE and Netscape, versions 3 or later, on Win & Macs? I'd
> certainly prefer to be more generic than that (support for unix being
> particularly near to my heart), but current browser stats indicate that
> anything that works on the above (NS/IE 3+ on Win/Mac) would cover a large
> enough percentage of the market to be worth doing. Requiring version 4
> browsers might even be tolerable now in many cases. (And I'm talking about
> the Internet at large, not an intranet.)

Netscape started supporting Unicode in the Windows version of Navigator
3.0. However, the feature was disabled by default, and could be enabled
only through a special registry setting. (Mac/Unix Nav3 doesn't support
Unicode.)

Navigator 4.0 supports Unicode on all of the platforms (Windows, Mac,
Unix), except that the Win32 version does not support the crucial font
switching (font linking in MS-speak). This means that the Win32 version
of Nav4 will only use one font for Unicode documents. (Win16/Mac/Unix
Nav4 supports font switching.)

Moreover, in Win32, the font must be set manually by the user in the
font preferences dialog. The default fonts for Unicode documents are
Times and Courier, even in the Japanese version of Navigator. So
Japanese UTF-8 documents will not display correctly on the average
Japanese Win32 Nav4 user's machine, since most users do not fiddle with
font prefs, particularly the Unicode ones.

So I suppose you could take a look at Navigator's market share in
non-Times/Courier markets such as Japan, Korea, Taiwan, China, etc, and
if you think that market share is small enough to ignore those users,
you could  choose to use UTF-8 in your application (HTML form + CGI).

If you decide that their market share is not small enough to ignore, you
could support them via multiple monolingual documents in traditional
charsets such as Shift_JIS, EUC-KR, Big5, GB2312, etc.

> > > In theory, if you can reliably label the charset of the HTML document
> > > containing the form (via HTTP charset and HTML META charset), then the
> > > form submission should be in that charset too. You can then simply
> > > insert that charset label in the hidden input field too, and look at
> > > that when the form submission arrives.
> >
> > Doesn't work through transcoding (incl. translation) servers.  I've also
> > heard stories of old Japanese browsers that would transcode the input to
> the
> > platform encoding and then forget what the original was.  So forms are
> > submitted in the platform encoding, regardless.  Certainly broken,
> probably
> > mostly extinct by now, but still shows how a bad protocol can come and
> bite
> > you.
> 
> Yes, I obviously need to add to the above IE/NS on Win/Mac specification
> that it work on all major language versions of those browsers.

I think he may have been referring to old Japanese versions of Mosaic
and others, not the Japanese versions of Netscape.

> So, Fran�ois, it sounds as though your hack -- returning known data from a
> hidden field to determing the encoding -- might be needed as a data
> integrity check at the very least.
> 
> Now I'm wondering what such data would look like and what could be learned
> from it. If I just put a bunch of bytes up there and they're echoed back at
> me verbatim, what would that tell me? I can imagine putting up a page
> encoded in Shift-JIS with a hidden field also in Shift-JIS, using the
> ACCEPT-CHARSET="UTF-8" technique, and then testing the result to see whether
> it came back as UTF-8, unchanged, or other. If unchanged, though, would that
> mean the returned data really was Shift-JIS? It seems to me it could also be
> Big-5, Latin-1, or any of several other encodings, returned by a browser
> than used the default system encoding to encode form data.

Most browsers submit forms in the same charset as the original form. So
if your form is in Shift_JIS, and the user can actually read it, then
the browser must know that it is in Shift_JIS, and will submit the form
in Shift_JIS.

On the other hand, if the user is viewing your form through a translator
that translates Japanese to traditional Chinese, then the form might be
in Big5 by the time it reaches the browser. The form submission will
then also be in Big5.

One question is whether the hidden field will also have been translated.
I conducted a little experiment with AltaVista yesterday, and found that
the text in a Spanish submit button was not translated to English, while
the rest of the document was translated. Future versions of the
translator(s) may be more aggressive, however, and actually translate
the text in HTML attribute values too, including hidden fields perhaps?
But they wouldn't want to translate *all* HTML attribute values (e.g.
align="right"), so perhaps they wouldn't translate hidden fields either.

Erik