W3C home > Mailing lists > Public > www-international@w3.org > July to September 2000

RE: URL-encode international characters in Java?

From: Chris Wendt <christw@MICROSOFT.com>
Date: Thu, 6 Jul 2000 15:10:20 -0700
Message-ID: <C58E83D08FE21041A2E65B0F4051B9DB3AE08A@RED-MSG-19.redmond.corp.microsoft.com>
To: "'Vinod Balakrishnan'" <vinod@filemaker.com>, Lenny Turetsky <LTuretsky@salesforce.com>, "'www-international@w3c.org'" <www-international@w3c.org>, "'servlet-interest@java.sun.com'" <servlet-interest@java.sun.com>
URL encoding encodes bytes, not characters. The character encoding is a
separate, independent layer.

Vinod is probably referring to the ECMAScript Escape() function which
encodes every non-Latin1 character like %uxxxx where xxxx is the Unicode
code point in hex characters.
http://msdn.microsoft.com/scripting/JScript/doc/jsglobalescape.htm

I don't consider the ECMAScript method a valid, recognized URL encoding and
as far as I know, ECMAScript is the only service where this escaping method
is implemented.

IE5 and later will submit characters that don't fit the form document
charset like HTML numeric character references &#nnnnn;. The bytes with the
us-ascii representations &, # and ; are URL reserved bytes so they will be
URL escaped as %25, %23 and %3B resp.
Characters that do fit the form document charset undergo simple URL encoding
per byte.


-----Original Message-----
From: Vinod Balakrishnan [mailto:vinod@filemaker.com]
Sent: Thursday, July 06, 2000 1:52 PM
To: Lenny Turetsky; 'www-international@w3c.org';
'servlet-interest@java.sun.com'
Subject: Re: URL-encode international characters in Java?


You can encode Big-5 and other double byte script characters in UTF16. I 
have seen IE5 is encoding the URLs with "%u" prefix for UTF16. But in 
case of UTF8 we don't have any standard prefix for representing that yet.

-Vinod

>Hi all,
> 
>Is there a standard way to URL-encode non-English characters in Java? For
>example, I know that '?' is URL-encoded as '%3F', but I don't know how or
if
>Big-5 characters can be URL-encoded. I've experimented a bit, and found
that
>IE will encode things differently based on the charset of the HTML doc
which
>contains the form.
> 
>Ideally, I'd like to use functionality available in Java Servlets, or
>another Java code library, but any solutions would be much appreciated.
I've
>looked at Java's java.net.URLEncoder class, but it's encode() method won't
>do it, as documented in the JDC's bug database (
>http://developer.java.sun.com/developer/bugParade/bugs/4257115.html
><http://developer.java.sun.com/developer/bugParade/bugs/4257115.html> ).
> 
>Is the only known solution to write my own encoder? If so, where can I find
>a list of the character's that *don't* need to be encoded? Is it just
>[A-Za-z0-9_]?
> 
>Thanks,
>Lenny Turetsky
>
Received on Thursday, 6 July 2000 20:05:59 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT