- From: Paul Deuter <Paul.Deuter@plumtree.com>
- Date: Mon, 22 Oct 2001 08:27:10 -0700
- To: "Yves Arrouye" <yves@realnames.com>, "Shigemichi Yazawa" <yazawa@globalsight.com>, <www-international@w3.org>
- Cc: "souravm" <souravm@infy.com>
We have been struggling with this same problem here two. Let me clarify a couple of points however: 1. Both encodings CP1252 and 8859-1 have "holes". For 8859-1 the range 80-9F is invalid. For CP1252, the values 80, 81, 8D, 8E, 8F, 90, 9D, and 9E are invalid (according to Kano's book). 2. Roundtripping with Unicode works with both CP1252 and 8859-1 because all the valid characters of both these encodings are also in Unicode. If you start with a valid character in CP1252, you can roundtrip that character (i.e. convert it to Unicode and then back to CP1252) without loss of data. 3. The Servlet 2.3 spec has new features that allow a programmer to set character set on requests and responses. But there appears to be no way to do this reliably with earlier versions. The response.setContentType method will set the HTTP header and also cause Unicode strings to be converted as they are sent to the browser - this is fine for outputting text data to the browser. But there is no reliable way to convert characters that are sent to the server in the request. The often suggested method for converting characters in the request is to use a line of code that looks like this: String strParam = new String(request.getParameter("SomeName").getBytes("8859_1"), "UTF8"); What this code is supposed to do is "undo" the improper default conversion that occurs in getParameter. Supposedly by calling getBytes with 8859_1, you will convert your Unicode back into bytes and then re-interpret those bytes correctly, in this example, as UTF-8. This is where the "holes" become problematic. If your incoming request data really is UTF-8 then there may be octets whose values are in the invalid range for 8859-1 or CP1252. If that is the case, then the improper conversion in getParameter will cause these octets to be mapped to 0xFFFD and the subsequent getBytes will in turn convert the OxFFFD to 3F. The value of the original octet will be lost forever. Furthermore, the input stream is completely consumed by the request object, so the original data in unavailable for further processing. It seems that the only solution for pre 2.3 Servlet code is to mark the form data as binary so that the request object will not try to read it at all. Then write your own code to read the form data which should convert all the %HH values to octets and then interpret the stream as the appropriate character set and then parse out the form data. If anyone knows a better solution, I would be very glad to hear it. We cannot depend upon the Servlet 2.3 version yet because it is too new and not widely installed. -Paul Paul Deuter Internationalization Manager Plumtree Software paul.deuter@plumtree.com -----Original Message----- From: Yves Arrouye [mailto:yves@realnames.com] Sent: Monday, October 22, 2001 12:11 AM To: 'Shigemichi Yazawa'; www-international@w3.org Subject: RE: Servlet question > Yes, two wrong conversions make a right result, However, Cp1252 > doesn't always work this way. Cp1252 <-> Unicode mapping table > includes 5 undefined entries. If you pass 0x81, for example, to byte > to char converter, it is converted to U+fffd (REPLACEMENT CHARACTER) > and the round trip is not possible. Only ISO-8859-1 is the safe, round > trippable encoding as far as I know. Isn't ISO-8859-1 actually the one that has "holes" in C0/C1 that exhibit this very behavior? I thought that was the case, and windows-1252 was the one that used C1 for platform-specific character (see http://www-124.ibm.com/cvs/icu/charset/data/xml/windows-1252-2000.xml?re v=1. 1&content-type=text/x-cvsweb-markup where apparently U+0081 is mapped to 0x81 in windows-1252). YA
Received on Monday, 22 October 2001 11:25:41 UTC