Re: URL character set

Sam X. Sun (ssun@CNRI.Reston.VA.US)
Fri, 6 Mar 1998 06:14:31 -0500

Message-ID: <00ca01bd48f1$0d3a88a0$d7019784@ssun2.CNRI.Reston.Va.US>
From: "Sam X. Sun" <ssun@CNRI.Reston.VA.US>
To: "Larry Masinter" <>,
Cc: <uri@Bunyip.Com>
Date: Fri, 6 Mar 1998 06:14:31 -0500
Subject: Re: URL character set

Hi, Larry,

I'm assuming the word "URL" in your draft is going to be replaced by the
word "URI". And have some questions on it.

> Section 2. Syntax.
> ... For 8-bit-URLs, it is necessary to hex-encode reserved
> characters, delims, unwise special characters, white space,
> and characters that might otherwise be confusing when
> printed and then typed. (See RFC-DURST for details.)

Does the 8-bit-URL have to be UTF-8 encoded?

Is the RFC-DURST reference available somewhere, Martin?

>  3.1 Requirements for URL entry
>  ... However, for all other sequences of characters,
>  entry should result in characters, in logical order,
>  from the ISO 10646 character repertoire, encoded using
>  the UTF-8 method [RFC 2044], and then subsequently
>  encoded using the URL escape mechanism [RFC-URL-SYNTAX].

Where does the translation from local  encoding, say JIS, to UTF-8, and then
to URL hex escape take place?

Does this mean that, in the HTML document, the URL still have to be hex
encoded ASCII string?

>3.3 Requirements for display of URLs
>   Software that displays URLs to users (or other kinds of
>   transcription, e.g., deciding what to print in your
>   magazine) should of course be robust: don't tell users
>   about a URL they can't type!

This is where I'm not sure if we can enforce. Let's we have a community who
don't speak English, and their computing environment supports their own
encoding, take JIS for example. It's very likely for them to use JIS to type
in or define their URI names that are intended for use within that
community. My feeling is that we can't enforce everyone using ASCII or UTF-8
to encode all URIs, like we can't enforce all HTML documents to be encoded
in ASCII or UTF-8. So a better way might be to default to UTF-8, but allow
native encoding as long as the native encoding is specified somewhere in the
name reference.

It seems more nature to consider that URI syntax governs how URI to be
specified in the HTML document, not necessarily what get sent over the wire.
It is under this assumption that when we defined the "hdl:" syntax
(, many updates
due though.), we specify that it defaults to UTF-8 encoding, but when
preceded with, say "jis@", what follows will be in JIS encoding. I know
Martin is going to beat me up since it's "too" cumbersome, but I can't think
of a better solution:). I don't think every URI needs to be readable or
typable by everyone in the world, and we can't enforce everyone to do so.
It's the decision to be made by whoever give the URI name. I guess this is
the fundamental difference from what's specified in RFC1630. On the other
hand, if we don't define a way to help specify native encoding used for a
particular URI, we might end up with the current HTML mass where many
natively encoded documents don't specify their encoding, and hence the users
have to guess what the encoding is used when looking at non-ASCII documents.

Again, I'm confused of what URI syntax is for. Is it for what specified in
the HTML document, or it is for the URI that get transferred over the wire?
Could we make it more clear in the draft? But even it is the syntax for what
get transferred over the wire, I think it's still up to the scheme specific
service to decide what the syntax is, not the URI itself.


-----Original Message-----
From: Larry Masinter <>
To: Al Gilman <>
Cc: uri@Bunyip.Com <uri@Bunyip.Com>
Date: Tuesday, March 03, 1998 4:51 PM
Subject: URL character set

>Al Gilman
># The restriction to the current RFC-822-header-safe subset of
># ASCII is temporary under the plans as I hear them.  But it does
># not make sense to open this up to a schemewise free-for-all or the
># clients will choke on the necessary library.  Saying that some
># clients will support some schemes defeats the purpose.  The point
># of URIs is so that more clients can support more schemes.
># I think that
># "Character Set" Considered Harmful
># may be relevant here.
>Check out
>I think  we might want to remove the idea of a 'new kind of URL', though,
>and call it an EURI. If I get comments this week, I'll try to incorporate
>them in a revised version.