W3C home > Mailing lists > Public > uri@w3.org > March 2005

RE: Encoding URI From/To UTF-16 Questions

From: McDonald, Ira <imcdonald@sharplabs.com>
Date: Sun, 27 Mar 2005 09:05:44 -0800
Message-ID: <CFEE79A465B35C4385389BA5866BEDF00C7AFD@mailsrvnt02.enet.sharplabs.com>
To: "'James Cerra'" <jfcst24_public@yahoo.com>, Mike Brown <mike@skew.org>
Cc: uri@w3.org

Hi James,

Please don't percent encode UTF-16 in a URI.  RFC 3986 "URI Generic
Syntax" (which obsoletes RFC 2396) says on page 21:

  "The reg-name syntax allows percent-encoded octets in order to
   represent non-ASCII registered names in a uniform way that is
   independent of the underlying name resolution technology.  Non-ASCII
   characters must first be encoded according to UTF-8 [STD63], and then
   each octet of the corresponding UTF-8 sequence must be percent-
>  encoded to be represented as URI characters.  URI producing
>  applications must not use percent-encoding in host unless it is used
>  to represent a UTF-8 character sequence.  When a non-ASCII registered
   name represents an internationalized domain name intended for
   resolution via the DNS, the name must be transformed to the IDNA
   encoding [RFC3490] prior to name lookup.  URI producers should
   provide these registered names in the IDNA encoding, rather than a
   percent-encoding, if they wish to maximize interoperability with
   legacy URI resolvers."

Now that applies only to the domain name part of a URI, but the point
is that it's impossible to mix two different 'native' encodings in a 
single URI - because the receiver couldn't possibly know how to
decode them.

- Ira

Ira McDonald (Musician / Software Architect)
Blue Roof Music / High North Inc
PO Box 221  Grand Marais, MI  49839
phone: +1-906-494-2434
email: imcdonald@sharplabs.com

-----Original Message-----
From: uri-request@w3.org [mailto:uri-request@w3.org]On Behalf Of James
Sent: Sunday, March 27, 2005 4:46 AM
To: Mike Brown
Cc: uri@w3.org
Subject: Re: Encoding URI From/To UTF-16 Questions


Thanks for the explanation.  It was incredibly enlightening, but I'm still a

little confused.


I think I undestand.  So say the program got the U+4F5B HAN IDEOGRAPH 
character, and the user wants to use UTF-16 as the character encoding for
bytes.  Then the program should:


I appreciate your help.  Thanks!

Jimmy Cerra

P.S.  I rewrote this responce several times as I came to understand you
Please excuse (or point out) and incongruities.

[1] http://www.wiwiss.fu-berlin.de/suhl/bizer/d2rmap/D2Rmap.htm

[2] RFC 3986, Sections 2.3 and 2.4.

[3] http://skew.org/xml/stylesheets/url-encode/

[4] http://www.w3.org/International/O-URL-code.html

[5] java.net.URLEncoder and java.net.URLDecoder

Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
Received on Sunday, 27 March 2005 17:06:36 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:09 UTC