RE: Servlet question

We have been struggling with this same problem here two.  Let me
clarify a couple of points however:

1.  Both encodings CP1252 and 8859-1 have "holes".  For 8859-1
the range 80-9F is invalid.  For CP1252, the values 80, 81, 8D,
8E, 8F, 90, 9D, and 9E are invalid (according to Kano's book).

2.  Roundtripping with Unicode works with both CP1252 and 8859-1 
because all the valid characters of both these encodings are also 
in Unicode.  If you start with a valid character in CP1252, you can
roundtrip that character (i.e. convert it to Unicode and then back to
CP1252) without loss of data.

3.  The Servlet 2.3 spec has new features that allow a programmer
to set character set on requests and responses.  But there appears
to be no way to do this reliably with earlier versions.

The response.setContentType method will set the HTTP header and also
cause Unicode strings to be converted as they are sent to the
browser - this is fine for outputting text data to the browser.  But 
there is no reliable way to convert characters that
are sent to the server in the request.  The often suggested method
for converting characters in the request is to use a line of code
that looks like this:

String strParam = new
String(request.getParameter("SomeName").getBytes("8859_1"), "UTF8");

What this code is supposed to do is "undo" the improper default
conversion that occurs
in getParameter.  Supposedly by calling getBytes with 8859_1, you will
convert your
Unicode back into bytes and then re-interpret those bytes correctly, in
this example, as UTF-8.

This is where the "holes" become problematic.  If your incoming request
data really is
UTF-8 then there may be octets whose values are in the invalid range for
8859-1 or CP1252.
If that is the case, then the improper conversion in getParameter will
cause these octets
to be mapped to 0xFFFD and the subsequent getBytes will in turn convert
the OxFFFD to 3F.
The value of the original octet will be lost forever.  Furthermore, the
input stream is
completely consumed by the request object, so the original data in
unavailable for further 
processing.

It seems that the only solution for pre 2.3 Servlet code is to mark the
form data as
binary so that the request object will not try to read it at all.  Then
write your own
code to read the form data which should convert all the %HH values to
octets and then
interpret the stream as the appropriate character set and then parse out
the form data.

If anyone knows a better solution, I would be very glad to hear it.  We
cannot depend upon
the Servlet 2.3 version yet because it is too new and not widely
installed.

-Paul






Paul Deuter
Internationalization Manager
Plumtree Software
paul.deuter@plumtree.com 
 


-----Original Message-----
From: Yves Arrouye [mailto:yves@realnames.com]
Sent: Monday, October 22, 2001 12:11 AM
To: 'Shigemichi Yazawa'; www-international@w3.org
Subject: RE: Servlet question


> Yes, two wrong conversions make a right result, However, Cp1252
> doesn't always work this way. Cp1252 <-> Unicode mapping table
> includes 5 undefined entries. If you pass 0x81, for example, to byte
> to char converter, it is converted to U+fffd (REPLACEMENT CHARACTER)
> and the round trip is not possible. Only ISO-8859-1 is the safe, round
> trippable encoding as far as I know.

Isn't ISO-8859-1 actually the one that has "holes" in C0/C1 that exhibit
this very behavior? I thought that was the case, and windows-1252 was
the
one that used C1 for platform-specific character (see
http://www-124.ibm.com/cvs/icu/charset/data/xml/windows-1252-2000.xml?re
v=1.
1&content-type=text/x-cvsweb-markup where apparently U+0081 is mapped to
0x81 in windows-1252).

YA

Received on Monday, 22 October 2001 11:25:41 UTC