Re: Servlet question from Tex Texin on 2001-10-22 (www-international@w3.org from October to December 2001)

From: Tex Texin <texin@progress.com>
Date: Mon, 22 Oct 2001 12:30:37 -0400
To: Paul Deuter <Paul.Deuter@plumtree.com>
CC: Yves Arrouye <yves@realnames.com>, Shigemichi Yazawa <yazawa@globalsight.com>, www-international@w3.org, souravm <souravm@infy.com>
Message-ID: <3BD449AD.5B47375D@progress.com>
Kano's book is outdated.

In 1252, 80 is the Euro.
8E and 9E are also assigned now to the Z with Caron (or Hacek).
See:
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

This doesn't detract from your other points. 

The fact that Microsoft code pages are moving targets is a related
issue.

tex

Paul Deuter wrote:
> 
> We have been struggling with this same problem here two.  Let me
> clarify a couple of points however:
> 
> 1.  Both encodings CP1252 and 8859-1 have "holes".  For 8859-1
> the range 80-9F is invalid.  For CP1252, the values 80, 81, 8D,
> 8E, 8F, 90, 9D, and 9E are invalid (according to Kano's book).
> 
> 2.  Roundtripping with Unicode works with both CP1252 and 8859-1
> because all the valid characters of both these encodings are also
> in Unicode.  If you start with a valid character in CP1252, you can
> roundtrip that character (i.e. convert it to Unicode and then back to
> CP1252) without loss of data.
> 
> 3.  The Servlet 2.3 spec has new features that allow a programmer
> to set character set on requests and responses.  But there appears
> to be no way to do this reliably with earlier versions.
> 
> The response.setContentType method will set the HTTP header and also
> cause Unicode strings to be converted as they are sent to the
> browser - this is fine for outputting text data to the browser.  But
> there is no reliable way to convert characters that
> are sent to the server in the request.  The often suggested method
> for converting characters in the request is to use a line of code
> that looks like this:
> 
> String strParam = new
> String(request.getParameter("SomeName").getBytes("8859_1"), "UTF8");
> 
> What this code is supposed to do is "undo" the improper default
> conversion that occurs
> in getParameter.  Supposedly by calling getBytes with 8859_1, you will
> convert your
> Unicode back into bytes and then re-interpret those bytes correctly, in
> this example, as UTF-8.
> 
> This is where the "holes" become problematic.  If your incoming request
> data really is
> UTF-8 then there may be octets whose values are in the invalid range for
> 8859-1 or CP1252.
> If that is the case, then the improper conversion in getParameter will
> cause these octets
> to be mapped to 0xFFFD and the subsequent getBytes will in turn convert
> the OxFFFD to 3F.
> The value of the original octet will be lost forever.  Furthermore, the
> input stream is
> completely consumed by the request object, so the original data in
> unavailable for further
> processing.
> 
> It seems that the only solution for pre 2.3 Servlet code is to mark the
> form data as
> binary so that the request object will not try to read it at all.  Then
> write your own
> code to read the form data which should convert all the %HH values to
> octets and then
> interpret the stream as the appropriate character set and then parse out
> the form data.
> 
> If anyone knows a better solution, I would be very glad to hear it.  We
> cannot depend upon
> the Servlet 2.3 version yet because it is too new and not widely
> installed.
> 
> -Paul
> 
> Paul Deuter
> Internationalization Manager
> Plumtree Software
> paul.deuter@plumtree.com
> 
> 
> -----Original Message-----
> From: Yves Arrouye [mailto:yves@realnames.com]
> Sent: Monday, October 22, 2001 12:11 AM
> To: 'Shigemichi Yazawa'; www-international@w3.org
> Subject: RE: Servlet question
> 
> > Yes, two wrong conversions make a right result, However, Cp1252
> > doesn't always work this way. Cp1252 <-> Unicode mapping table
> > includes 5 undefined entries. If you pass 0x81, for example, to byte
> > to char converter, it is converted to U+fffd (REPLACEMENT CHARACTER)
> > and the round trip is not possible. Only ISO-8859-1 is the safe, round
> > trippable encoding as far as I know.
> 
> Isn't ISO-8859-1 actually the one that has "holes" in C0/C1 that exhibit
> this very behavior? I thought that was the case, and windows-1252 was
> the
> one that used C1 for platform-specific character (see
> http://www-124.ibm.com/cvs/icu/charset/data/xml/windows-1252-2000.xml?re
> v=1.
> 1&content-type=text/x-cvsweb-markup where apparently U+0081 is mapped to
> 0x81 in windows-1252).
> 
> YA

-- 
-------------------------------------------------------------
Tex Texin                    Director, International Business
mailto:Texin@Progress.com    Tel: +1-781-280-4271
the Progress Company         Fax: +1-781-280-4655
-------------------------------------------------------------
Received on Monday, 22 October 2001 12:30:43 UTC