RE: URL-encode international characters in Java? from Martin J. Duerst on 2000-07-07 (www-international@w3.org from July to September 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Fri, 07 Jul 2000 15:52:48 +0900
To: Chris Wendt <christw@MICROSOFT.com>, "'Vinod Balakrishnan'" <vinod@filemaker.com>, Lenny Turetsky<LTuretsky@salesforce.com>, "'www-international@w3c.org'"<www-international@w3c.org>, "'servlet-interest@java.sun.com'"<servlet-interest@java.sun.com>
Message-Id: <4.2.0.58.J.20000707153856.00b55540@sh.w3.mag.keio.ac.jp>

At 00/07/06 15:10 -0700, Chris Wendt wrote:
>URL encoding encodes bytes, not characters. The character encoding is a
>separate, independent layer.
>
>Vinod is probably referring to the ECMAScript Escape() function which
>encodes every non-Latin1 character like %uxxxx where xxxx is the Unicode
>code point in hex characters.
>http://msdn.microsoft.com/scripting/JScript/doc/jsglobalescape.htm
>
>I don't consider the ECMAScript method a valid, recognized URL encoding and
>as far as I know, ECMAScript is the only service where this escaping method
>is implemented.

True. ECMAScript went into a direction that other things didn't.
An update of the ECMAScript standard contains a new function that
encodes all non-ASCII characters (plus some ASCII characters that
are not allowed in URIs) by first using UTF-8 and then encoding
the resulting bytes with %hh.

Using UTF-8 is recommended for all new URI schemes, for URIs in XML,
and so on. Please see http://www.w3.org/International/O-URL-and-ident.html.

>IE5 and later will submit characters that don't fit the form document
>charset like HTML numeric character references &#nnnnn;. The bytes with the
>us-ascii representations &, # and ; are URL reserved bytes so they will be
>URL escaped as %25, %23 and %3B resp.

If UTF-8 is used for the page, of course, there won't be any such
characters.

>Characters that do fit the form document charset undergo simple URL encoding
>per byte.

Does IE support the 'accept-charset' parameter on FORM?

Regards,   Martin.




>-----Original Message-----
>From: Vinod Balakrishnan [mailto:vinod@filemaker.com]
>Sent: Thursday, July 06, 2000 1:52 PM
>To: Lenny Turetsky; 'www-international@w3c.org';
>'servlet-interest@java.sun.com'
>Subject: Re: URL-encode international characters in Java?
>
>
>You can encode Big-5 and other double byte script characters in UTF16. I
>have seen IE5 is encoding the URLs with "%u" prefix for UTF16. But in
>case of UTF8 we don't have any standard prefix for representing that yet.
>
>-Vinod
>
> >Hi all,
> >
> >Is there a standard way to URL-encode non-English characters in Java? For
> >example, I know that '?' is URL-encoded as '%3F', but I don't know how or
>if
> >Big-5 characters can be URL-encoded. I've experimented a bit, and found
>that
> >IE will encode things differently based on the charset of the HTML doc
>which
> >contains the form.
> >
> >Ideally, I'd like to use functionality available in Java Servlets, or
> >another Java code library, but any solutions would be much appreciated.
>I've
> >looked at Java's java.net.URLEncoder class, but it's encode() method won't
> >do it, as documented in the JDC's bug database (
> >http://developer.java.sun.com/developer/bugParade/bugs/4257115.html
> ><http://developer.java.sun.com/developer/bugParade/bugs/4257115.html> ).
> >
> >Is the only known solution to write my own encoder? If so, where can I find
> >a list of the character's that *don't* need to be encoded? Is it just
> >[A-Za-z0-9_]?
> >
> >Thanks,
> >Lenny Turetsky
> >

Received on Friday, 7 July 2000 03:27:52 UTC