JSP page directive contentType overriden by Apache tomcat?

Hi,

I've been wrestling with a mysterious problem for the last few hours. I 
made a patch to
the web search front-end of  'Nutch' (http://www.nutch.org  an open 
source search engine
that strives be an open source google [1]) so that query strings made of 
characters outside ISO-8859-1
character repertoire can work.

Following the standard-step of adding contentType and pageEncoding 
directives at the beginning
of jsp files (I also added request.setCharacterEncoding("UTF-8"); along 
with making sure that
that's honored because recent versions of Apache tomcat by default 
ignores that for GET),
I expected everything to work. To my great surprise, all the JSP files with
'contentType="text/html; charset=UTF-8"' directive still emit 
'Content-Type:text/html; charset=ISO-8859-1'
in HTTP header. Even more surprsing is that cached versions of 
translated java source files for
those jsp files have the following line:

response.setContenttype("text/html; charset=UTF-8");

It's completely beyond me how I've been getting  'text/html; 
charset=ISO-8859-1' despite that.

You can try it at http://pippin.kaist.ac.kr:8080. I ran nutch crawler to 
fetch a small number (about
4000) of pages in several different scripts (if you give '1234' as a 
query, you'll get 4 hits). The
search result page(handled by search.jsp) is supposed to be in UTF-8 
with the correct C-T header
emitted in HTTP header. 

Is there anyone who's been beaten by this bizzare problem? It'd be great 
to know how that was solved.

Thank you tons in advance,

Jungshik


[1] Needless to say, there are a number of things to improve in I18N as 
well as in other aspects before Nutch can compete with Google.

Received on Friday, 16 July 2004 17:25:17 UTC