Re: revised "generic syntax" internet draft

Chris Newman (
Tue, 15 Apr 1997 13:07:23 -0700 (PDT)

Date: Tue, 15 Apr 1997 13:07:23 -0700 (PDT)
From: Chris Newman <>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <>
To: IETF URI list <>
Message-Id: <>

Here's the approaches to i18n I've seen:

(1) US-ASCII only

(2) ISO-8859-1 only

(3) whatever localized character set is in use

(4) Explicit labelling of character set

(5) Unicode derivative.
(1) Never works because it doesn't satisfy demand.

(2) Never works and is even worse than (1) because not only does it fail
to satisfy demand, but it uses up the "undefined" codepoints in such a way
that an interoperable solution *can't* be deployed.

(3) Never works, because it doesn't interoperate.  It results in a bunch
of islands which can't communicate, except via US-ASCII.

(4) Works fine, but is very hard to support for ideographic characters.
Dealing with mapping tables between ISO-2022, Unicode and whatever
character set is supported by the display system is very hard.

(5) Works fine, and has potential to be easier to support than (4).

The status quo in URLs is a mixture of (1), (2), and (3).  This is
completely unacceptable for an interoperable solution.  We *MUST* move
towards (4) or (5).  Given that I've heard no proposals along the lines of
MIME header encoded words, the only solution on the table is (5).

I will also point out than when a URL contains unencoded 8-bit characters
and is embedded in a properly charset-labelled document, there are no
problems as the interpretation is clear.   We do need to deal with the
interpretation of %-encoded 8-bit characters.  If we're ambitious, we can
also address the issue of unlabelled unencoded 8-bit characters, but I'd
be tempted to avoid that rathole.

The biggest failure of HTTP/HTML was choosing (2) above when MIME already
had a perfectly functional solution (4).