RE: JSP containers and default charset (was: Re DefaultCharset considered harmful)

Hi Jungshik,

I think you're forgetting what happens to a JSP page. You can't just have
the JSP container omit to mention the charset in the HTTP header because the
encoding is important in the lifecycle of a JSP page: JSP pages are compiled
into servlets. That is, they are turned into a small Java program that
generates the HTML. The HTML content in a JSP page just becomes a series of
Java String objects inside the servlet. The Strings must be converted back
to bytes using some encoding (since Strings are UTF-16 based Java objects).
You set the encoding used for this conversion with the page contentType
directive.

So you see: if you don't tell JSP container's servlet generate what encoding
you want, it has to make some assumption.

There is another part to this process as well: you have to generate the
servlet (i.e. those String objects) to begin with. Parsing the HTML to look
for a META tag is not a really reliable solution. So unless the file is
entirely ASCII to begin with, you have to tell the JSP container how to read
the bytes in the file to generate the servlet. This is what the pageEncoding
directive does.

One could argue with the choice of Latin-1 as the JSP default, but any other
encoding, including UTF-8, wouldn't really solve the problem: people want to
use legacy encodings from time to time as well. Setting the page encoding
and content type should just be automatically part of what a JSP developer
does in every page.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: public-i18n-geo-request@w3.org
> [mailto:public-i18n-geo-request@w3.org]On Behalf Of Jungshik Shin
> Sent: dimanche 30 novembre 2003 08:49
> To: Martin Duerst
> Cc: public-i18n-geo@w3.org
> Subject: JSP containers and default charset (was: Re DefaultCharset
> considered harmful)
>
>
>
> On Thu, 25 Sep 2003, Martin Duerst wrote:
>
> > Regarding the configuration problems with Apache, I
> > think the main culprit is the configuration file httpd.conf,
> > as shipped with the distribution. This contains:
> >
> > # Specify a default charset for all pages sent out. This is
> > # always a good idea and opens the door for future internationalisation
>
> Recently I found that Apache-tomcat always adds 'charset=ISO-8859-1'
> (to virtually all Content-Type headers whether textual or not) unless
> it's explicitly overriden in the JSP with either of the following lines.
>
> <%@ page pageEncoding="CHARSET" %>
>
> <%@ page contentType="CONTENT-TYPE; charset=CHARSET" %>
>
> So, including the following line in a JSP without either of the above
> leads to a conflict:
>
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
>
>
> It seems like the JSP specification mandates this behavior so that it's
> not just Jakarta Tomcat. I'm not sure whether this is good or bad.
> Certainly JSP offers a  way to specify the character encoding of
> individual pages (it took me a while to find that out [1]), but
> someone who
> doesn't know that (and who believes that adding meta tag would work)
> may be taken by surprise.
>
> Do we have to consider asking those in charge of the JSP specification
> to change it so that by default NO charset parameter is added?
>
>
> Jungshik
>
> [1] I should have turned to
> http://www.w3.org/International/O-HTTP-charset.html
> BTW, the second method of setting the page encoding in JSP is not
> mentioned
> in the above tip.
>
> <%@ page pageEncoding="CHARSET" %>

Received on Sunday, 30 November 2003 13:01:36 UTC