- From: Tex Texin <texin@progress.com>
- Date: Mon, 22 Oct 2001 12:30:37 -0400
- To: Paul Deuter <Paul.Deuter@plumtree.com>
- CC: Yves Arrouye <yves@realnames.com>, Shigemichi Yazawa <yazawa@globalsight.com>, www-international@w3.org, souravm <souravm@infy.com>
Kano's book is outdated. In 1252, 80 is the Euro. 8E and 9E are also assigned now to the Z with Caron (or Hacek). See: http://www.microsoft.com/globaldev/reference/sbcs/1252.htm This doesn't detract from your other points. The fact that Microsoft code pages are moving targets is a related issue. tex Paul Deuter wrote: > > We have been struggling with this same problem here two. Let me > clarify a couple of points however: > > 1. Both encodings CP1252 and 8859-1 have "holes". For 8859-1 > the range 80-9F is invalid. For CP1252, the values 80, 81, 8D, > 8E, 8F, 90, 9D, and 9E are invalid (according to Kano's book). > > 2. Roundtripping with Unicode works with both CP1252 and 8859-1 > because all the valid characters of both these encodings are also > in Unicode. If you start with a valid character in CP1252, you can > roundtrip that character (i.e. convert it to Unicode and then back to > CP1252) without loss of data. > > 3. The Servlet 2.3 spec has new features that allow a programmer > to set character set on requests and responses. But there appears > to be no way to do this reliably with earlier versions. > > The response.setContentType method will set the HTTP header and also > cause Unicode strings to be converted as they are sent to the > browser - this is fine for outputting text data to the browser. But > there is no reliable way to convert characters that > are sent to the server in the request. The often suggested method > for converting characters in the request is to use a line of code > that looks like this: > > String strParam = new > String(request.getParameter("SomeName").getBytes("8859_1"), "UTF8"); > > What this code is supposed to do is "undo" the improper default > conversion that occurs > in getParameter. Supposedly by calling getBytes with 8859_1, you will > convert your > Unicode back into bytes and then re-interpret those bytes correctly, in > this example, as UTF-8. > > This is where the "holes" become problematic. If your incoming request > data really is > UTF-8 then there may be octets whose values are in the invalid range for > 8859-1 or CP1252. > If that is the case, then the improper conversion in getParameter will > cause these octets > to be mapped to 0xFFFD and the subsequent getBytes will in turn convert > the OxFFFD to 3F. > The value of the original octet will be lost forever. Furthermore, the > input stream is > completely consumed by the request object, so the original data in > unavailable for further > processing. > > It seems that the only solution for pre 2.3 Servlet code is to mark the > form data as > binary so that the request object will not try to read it at all. Then > write your own > code to read the form data which should convert all the %HH values to > octets and then > interpret the stream as the appropriate character set and then parse out > the form data. > > If anyone knows a better solution, I would be very glad to hear it. We > cannot depend upon > the Servlet 2.3 version yet because it is too new and not widely > installed. > > -Paul > > Paul Deuter > Internationalization Manager > Plumtree Software > paul.deuter@plumtree.com > > > -----Original Message----- > From: Yves Arrouye [mailto:yves@realnames.com] > Sent: Monday, October 22, 2001 12:11 AM > To: 'Shigemichi Yazawa'; www-international@w3.org > Subject: RE: Servlet question > > > Yes, two wrong conversions make a right result, However, Cp1252 > > doesn't always work this way. Cp1252 <-> Unicode mapping table > > includes 5 undefined entries. If you pass 0x81, for example, to byte > > to char converter, it is converted to U+fffd (REPLACEMENT CHARACTER) > > and the round trip is not possible. Only ISO-8859-1 is the safe, round > > trippable encoding as far as I know. > > Isn't ISO-8859-1 actually the one that has "holes" in C0/C1 that exhibit > this very behavior? I thought that was the case, and windows-1252 was > the > one that used C1 for platform-specific character (see > http://www-124.ibm.com/cvs/icu/charset/data/xml/windows-1252-2000.xml?re > v=1. > 1&content-type=text/x-cvsweb-markup where apparently U+0081 is mapped to > 0x81 in windows-1252). > > YA -- ------------------------------------------------------------- Tex Texin Director, International Business mailto:Texin@Progress.com Tel: +1-781-280-4271 the Progress Company Fax: +1-781-280-4655 -------------------------------------------------------------
Received on Monday, 22 October 2001 12:30:43 UTC