- From: Jungshik Shin <jshin@i18nl10n.com>
- Date: Sat, 17 Jul 2004 06:25:13 +0900
- To: www-international@w3.org
Hi, I've been wrestling with a mysterious problem for the last few hours. I made a patch to the web search front-end of 'Nutch' (http://www.nutch.org an open source search engine that strives be an open source google [1]) so that query strings made of characters outside ISO-8859-1 character repertoire can work. Following the standard-step of adding contentType and pageEncoding directives at the beginning of jsp files (I also added request.setCharacterEncoding("UTF-8"); along with making sure that that's honored because recent versions of Apache tomcat by default ignores that for GET), I expected everything to work. To my great surprise, all the JSP files with 'contentType="text/html; charset=UTF-8"' directive still emit 'Content-Type:text/html; charset=ISO-8859-1' in HTTP header. Even more surprsing is that cached versions of translated java source files for those jsp files have the following line: response.setContenttype("text/html; charset=UTF-8"); It's completely beyond me how I've been getting 'text/html; charset=ISO-8859-1' despite that. You can try it at http://pippin.kaist.ac.kr:8080. I ran nutch crawler to fetch a small number (about 4000) of pages in several different scripts (if you give '1234' as a query, you'll get 4 hits). The search result page(handled by search.jsp) is supposed to be in UTF-8 with the correct C-T header emitted in HTTP header. Is there anyone who's been beaten by this bizzare problem? It'd be great to know how that was solved. Thank you tons in advance, Jungshik [1] Needless to say, there are a number of things to improve in I18N as well as in other aspects before Nutch can compete with Google.
Received on Friday, 16 July 2004 17:25:17 UTC