Re: UTF-8 transfers from browser forms to servers from Martin Duerst on 2003-03-02 (www-international@w3.org from January to March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 02 Mar 2003 09:41:42 -0500
To: "Chris Haynes" <chris@harvington.org.uk>, <www-international@w3.org>
Message-Id: <4.2.0.58.J.20030301183340.03ca2a58@localhost>
At 20:43 03/02/20 +0000, Chris Haynes wrote:

>Dennis,
>
>I'd be interested to hear if the following helps you (or adds to the
>confusion!):
>
>http://jetty.mortbay.com/jetty/doc/international.html

Hello Chris,

Some comments:
"The Internet was initially designed and constructed using basic English
characters encoded in the 7-bit US-ASCII character set."
    If 'Internet' means TCP/IP, then this is wrong, because TCP/IP is
    8-bit clean. If 'Internet' stands for WWW (which it does for many
    people), then this is wrong because the WWW was developed at CERN
    in Geneva starting with iso-8859-1.

"There is a default character set ISO-8859-1, which supports most western
European languages, and is currently the official 'default' content encoding
for content carried by HTTP"
    This 'default' is still in some specs, but in practice, it's utterly
    worthless. Please tell people to always specify the encoding.

"This mechanism can be unreliable; the browser's user can select the
encoding to be applied, which may be different from that intended by
the servlet designer."
    This is not wrong, but misleading. Users will change the encoding
    if they can't otherwise read the page, but won't want to read a
    nicely displaying page as garbade.

"Today the Internet is converging on a single, common encoding - Unicode -"
    Unicode isn't an encoding. UTF-8, UTF-16,... are encodings.

"Unicode is the only character encoding used in XML and is now the default
in HTML, XHML and in most Java implementations."
    - It's not the default in HTML
    - XHML->XHTML
    - 'most Java implementations'? Do you know any Java implementation
      that is not using Unicode? My understanding is that if any such
      thing would exist, it's not Java.

Post vs. Get: This also depends on caching, sideeffects, bookmarking,...
    Please see http://www.w3.org/2001/tag/doc/get7.html#i18n (and the
    whole document) [I saw that you got to this later, but
    I thought it should come earier.]

"Thus, although any desired octet sequence can be placed in a URL,
none of the standards tell the web server how to interpret that octet
sequence."
    There is a very widely respected convention: You get the octets
    back in the same encoding that the page was sent out.

"By the time the characters are made available to the Servlet as a
String it is in the Unicode encoding used by Java."
    This is complicated by Java servlet's problems in this area.
    Although in newer versions, there are some calls to tell the
    servlet logic in which encoding to interpret the URI,
    (request.setCharacterEncoding()), that just didn't work for me.
    In my servlet code, I have had to use things similar to:
    String field = new String 
(request.getParameter("fieldname").getBytes("iso-8859-1"), "utf-8");
    But that may be different for different servlet implementations.

"There is an Internet draft (which expires Oct. 2002)"
    That has been updated. And it will be updated again very soon.

"Accordingly, Jetty 4.1 has reverted to a default encoding of ISO-8859-1."
    The right thing to do here is to not to go back and forth with defaults,
    but to make it easy and straightforward for servlet programmers to
    get the data in the encoding they expect.

"The first example, with the literal u", should only be used if the
character encoding can be relied upon, and if support for 'legacy'
browsers (those not understanding the &...; encoding) is essential."
    No. If you know the encoding (and on the server, you definitely
    should know the encoding of your pages), then using that encoding
    is the best thing to do. The &...; variants are just fallbacks
    for the case that you cannot directly encode a character.

"Where the character has a defined abbreviation (such as &uuml;
for u-umlaut)" (in XML)
    In XML, no such things as &uuml; is predefined.

"Use of the decimal form (&#913;) seems now to be unfashionable
in W3C circles."
    Well, yes, but it's not a matter of fashion. The Unicode standard
    is all hex, and not having to do conversions is a big win.

XForms: You may want to point out that XForms requires the use of UTF-8
    for GET. Using UTF-8 pages and servlets that process the results
    now can make moving over to XForms easier.

"Character Model (currently a working draft)"
    You may want to say that that, too, is in last call.


Hope this helps.    Regards,    Martin.
Received on Sunday, 2 March 2003 11:39:44 UTC