- From: Martin Duerst <duerst@w3.org>
- Date: Sun, 02 Mar 2003 09:41:42 -0500
- To: "Chris Haynes" <chris@harvington.org.uk>, <www-international@w3.org>
At 20:43 03/02/20 +0000, Chris Haynes wrote: >Dennis, > >I'd be interested to hear if the following helps you (or adds to the >confusion!): > >http://jetty.mortbay.com/jetty/doc/international.html Hello Chris, Some comments: "The Internet was initially designed and constructed using basic English characters encoded in the 7-bit US-ASCII character set." If 'Internet' means TCP/IP, then this is wrong, because TCP/IP is 8-bit clean. If 'Internet' stands for WWW (which it does for many people), then this is wrong because the WWW was developed at CERN in Geneva starting with iso-8859-1. "There is a default character set ISO-8859-1, which supports most western European languages, and is currently the official 'default' content encoding for content carried by HTTP" This 'default' is still in some specs, but in practice, it's utterly worthless. Please tell people to always specify the encoding. "This mechanism can be unreliable; the browser's user can select the encoding to be applied, which may be different from that intended by the servlet designer." This is not wrong, but misleading. Users will change the encoding if they can't otherwise read the page, but won't want to read a nicely displaying page as garbade. "Today the Internet is converging on a single, common encoding - Unicode -" Unicode isn't an encoding. UTF-8, UTF-16,... are encodings. "Unicode is the only character encoding used in XML and is now the default in HTML, XHML and in most Java implementations." - It's not the default in HTML - XHML->XHTML - 'most Java implementations'? Do you know any Java implementation that is not using Unicode? My understanding is that if any such thing would exist, it's not Java. Post vs. Get: This also depends on caching, sideeffects, bookmarking,... Please see http://www.w3.org/2001/tag/doc/get7.html#i18n (and the whole document) [I saw that you got to this later, but I thought it should come earier.] "Thus, although any desired octet sequence can be placed in a URL, none of the standards tell the web server how to interpret that octet sequence." There is a very widely respected convention: You get the octets back in the same encoding that the page was sent out. "By the time the characters are made available to the Servlet as a String it is in the Unicode encoding used by Java." This is complicated by Java servlet's problems in this area. Although in newer versions, there are some calls to tell the servlet logic in which encoding to interpret the URI, (request.setCharacterEncoding()), that just didn't work for me. In my servlet code, I have had to use things similar to: String field = new String (request.getParameter("fieldname").getBytes("iso-8859-1"), "utf-8"); But that may be different for different servlet implementations. "There is an Internet draft (which expires Oct. 2002)" That has been updated. And it will be updated again very soon. "Accordingly, Jetty 4.1 has reverted to a default encoding of ISO-8859-1." The right thing to do here is to not to go back and forth with defaults, but to make it easy and straightforward for servlet programmers to get the data in the encoding they expect. "The first example, with the literal u", should only be used if the character encoding can be relied upon, and if support for 'legacy' browsers (those not understanding the &...; encoding) is essential." No. If you know the encoding (and on the server, you definitely should know the encoding of your pages), then using that encoding is the best thing to do. The &...; variants are just fallbacks for the case that you cannot directly encode a character. "Where the character has a defined abbreviation (such as ü for u-umlaut)" (in XML) In XML, no such things as ü is predefined. "Use of the decimal form (Α) seems now to be unfashionable in W3C circles." Well, yes, but it's not a matter of fashion. The Unicode standard is all hex, and not having to do conversions is a big win. XForms: You may want to point out that XForms requires the use of UTF-8 for GET. Using UTF-8 pages and servlets that process the results now can make moving over to XForms easier. "Character Model (currently a working draft)" You may want to say that that, too, is in last call. Hope this helps. Regards, Martin.
Received on Sunday, 2 March 2003 11:39:44 UTC