- From: Tex Texin <texin@progress.com>
- Date: Mon, 22 Oct 2001 12:30:37 -0400
- To: Paul Deuter <Paul.Deuter@plumtree.com>
- CC: Yves Arrouye <yves@realnames.com>, Shigemichi Yazawa <yazawa@globalsight.com>, www-international@w3.org, souravm <souravm@infy.com>
Kano's book is outdated.
In 1252, 80 is the Euro.
8E and 9E are also assigned now to the Z with Caron (or Hacek).
See:
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
This doesn't detract from your other points.
The fact that Microsoft code pages are moving targets is a related
issue.
tex
Paul Deuter wrote:
>
> We have been struggling with this same problem here two. Let me
> clarify a couple of points however:
>
> 1. Both encodings CP1252 and 8859-1 have "holes". For 8859-1
> the range 80-9F is invalid. For CP1252, the values 80, 81, 8D,
> 8E, 8F, 90, 9D, and 9E are invalid (according to Kano's book).
>
> 2. Roundtripping with Unicode works with both CP1252 and 8859-1
> because all the valid characters of both these encodings are also
> in Unicode. If you start with a valid character in CP1252, you can
> roundtrip that character (i.e. convert it to Unicode and then back to
> CP1252) without loss of data.
>
> 3. The Servlet 2.3 spec has new features that allow a programmer
> to set character set on requests and responses. But there appears
> to be no way to do this reliably with earlier versions.
>
> The response.setContentType method will set the HTTP header and also
> cause Unicode strings to be converted as they are sent to the
> browser - this is fine for outputting text data to the browser. But
> there is no reliable way to convert characters that
> are sent to the server in the request. The often suggested method
> for converting characters in the request is to use a line of code
> that looks like this:
>
> String strParam = new
> String(request.getParameter("SomeName").getBytes("8859_1"), "UTF8");
>
> What this code is supposed to do is "undo" the improper default
> conversion that occurs
> in getParameter. Supposedly by calling getBytes with 8859_1, you will
> convert your
> Unicode back into bytes and then re-interpret those bytes correctly, in
> this example, as UTF-8.
>
> This is where the "holes" become problematic. If your incoming request
> data really is
> UTF-8 then there may be octets whose values are in the invalid range for
> 8859-1 or CP1252.
> If that is the case, then the improper conversion in getParameter will
> cause these octets
> to be mapped to 0xFFFD and the subsequent getBytes will in turn convert
> the OxFFFD to 3F.
> The value of the original octet will be lost forever. Furthermore, the
> input stream is
> completely consumed by the request object, so the original data in
> unavailable for further
> processing.
>
> It seems that the only solution for pre 2.3 Servlet code is to mark the
> form data as
> binary so that the request object will not try to read it at all. Then
> write your own
> code to read the form data which should convert all the %HH values to
> octets and then
> interpret the stream as the appropriate character set and then parse out
> the form data.
>
> If anyone knows a better solution, I would be very glad to hear it. We
> cannot depend upon
> the Servlet 2.3 version yet because it is too new and not widely
> installed.
>
> -Paul
>
> Paul Deuter
> Internationalization Manager
> Plumtree Software
> paul.deuter@plumtree.com
>
>
> -----Original Message-----
> From: Yves Arrouye [mailto:yves@realnames.com]
> Sent: Monday, October 22, 2001 12:11 AM
> To: 'Shigemichi Yazawa'; www-international@w3.org
> Subject: RE: Servlet question
>
> > Yes, two wrong conversions make a right result, However, Cp1252
> > doesn't always work this way. Cp1252 <-> Unicode mapping table
> > includes 5 undefined entries. If you pass 0x81, for example, to byte
> > to char converter, it is converted to U+fffd (REPLACEMENT CHARACTER)
> > and the round trip is not possible. Only ISO-8859-1 is the safe, round
> > trippable encoding as far as I know.
>
> Isn't ISO-8859-1 actually the one that has "holes" in C0/C1 that exhibit
> this very behavior? I thought that was the case, and windows-1252 was
> the
> one that used C1 for platform-specific character (see
> http://www-124.ibm.com/cvs/icu/charset/data/xml/windows-1252-2000.xml?re
> v=1.
> 1&content-type=text/x-cvsweb-markup where apparently U+0081 is mapped to
> 0x81 in windows-1252).
>
> YA
--
-------------------------------------------------------------
Tex Texin Director, International Business
mailto:Texin@Progress.com Tel: +1-781-280-4271
the Progress Company Fax: +1-781-280-4655
-------------------------------------------------------------
Received on Monday, 22 October 2001 12:30:43 UTC