Re: Globalizing URIs

Martin J Duerst (mduerst@ifi.unizh.ch)
Thu, 10 Aug 1995 21:35:44 +0200 (MET DST)


Message-Id: <9508101935.AA08382@mocha.bunyip.com>
Subject: Re: Globalizing URIs
To: ietf-lists@proper.com (Paul Hoffman)
Date: Thu, 10 Aug 1995 21:35:44 +0200 (MET DST)
Cc: mduerst@ifi.unizh.ch, uri@bunyip.com
In-Reply-To: <v0213050fac4fe0d02bbc@[165.227.40.30]> from "Paul Hoffman" at Aug 10, 95 09:19:36 am
From: Martin J Duerst <mduerst@ifi.unizh.ch>


I note aside: I have been made aware that uri-wg is kind of not existing
anymore. Still, it seems like this is the only place where this problem
can be discussed. What is the correct proceadure?



>I suggest that those most interested in making the stuff to the right of
>the domain name more readable simply propose a good way to do it and
>publicize it. It doesn't have to be an IETF standard: if everyone on the
>server side is doing it, the clients will soon follow. One of the original
>suggestions (adding a [XYZ] to the end of the the URL) sounds OK to me, as
>does putting the [XYZ] after the first / after the domain name. There will
>be heavy technical problems with either of these for many existing servers
>of many protocols, but they can be gotten around if the market demands it.

So here are some proposals (this is a collection of what has appeared
before on the html-wg list). After some discussion and refinement, I could
very well immagine that some of the proposals in this mail become part
of an internet-draft. For easier presentation, I assume that
uppercase letters are not part of ASCII, but have to be encoded specially:


A) Add information about the character set used somewhere in the URL

The concrete forms that have surfaced where (in their "nice" form):

A1)	<[ISO-8859-1]http://xxx.yyy.zz/AA/BB/CC.html>
A2)	<http:[ISO-8859-1]//xxx.yyy.zz/AA/BB/CC.html>
A3)	<http://xxx.yyy.zz/[ISO-8859-1]AA/BB/CC.html>
A4)	<http://xxx.yyy.zz/AA/BB/CC.html[ISO-8859-1]>
A5)	<http://xxx.yyy.zz/AA/BB/CC.html;ISO-8859-1>

These cases differ in that A1), A2), and probably A4) and A5)
show that they apply (potentially) to the whole URL. A2) and
especially A1 are easily separable from the URL without confusion,
and can be understood as an addition outside the URL proper on
how to input the URL (e.g. by selecting from a menu in the
"enter URL" dialog). A3) has a particularly easy implementation
in most cases, in that a link from directory "[ISO-8859-1]AA"
to "AA" is created. I.e. even if the client-side tool has no
idea about this scheme, and the user doesn't realize that
the [] part is not really part of the URL proper, this will work.
Of course, there has to be some mechanism that translates
between the nice form, e.g.
	<http://xxx.yyy.zz/[ISO-8859-1]AA/BB/CC.html>
and the plain form
	<http://xxx.yyy.zz/[ISO-8859-1]%41%41/%42%42/%43%43.html>
but this is only possible with the information between [ and ].



B) Specifying a single character set

This has already surfaced in this group. Unicode/ISO 10646 is
clearly the only candidate. The problem is that raw USC-2
Unicode is not really suited, it would make the plain form of
our example
	<http://xxx.yyy.zz/%00%41%00%41/%00%42%00%42/%00%43%00%43.html>
whereas in the fragment <aa.html> (plain), aa would denote the Han ideogram
U+6161, with heavy backwards incompatibilities.

But there are other encodings of Unicode. In particular,
UTF-7 and UTF-8 have to be examined. UTF-8 leaves ASCII untouched,
and encodes the rest of Unicode with two or three octets that have their
MSB set. Thus it can be directly recommended. The main disadvantage
is that e.g. a single Japanese character expands to nine ASCII characters
in the plain representation (three octets with MSB set -> three triples
of the form '%HH").

UTF-7 is similar to BASE64, but designed for short stretches. It uses
the '+' to start a sequence of characters outside ASCII. This sequence
is terminated with '-', and a single '+' is represented as '+-'.
For using UTF-7 directly, it is now too late. A possibility is to combine
it with the '%' URL escape, i.e. to start with "%+" instead of '+', with
no need for "%+-". This has the advantage that the use of UTF-7 is
very clearly identified, and that there is no danger of confusion.
The problem is that clients not aware of the new "%+" syntax will
produce an error. The plain form will be shorter than for UTF-8,
especially if more than one successive character are encoded
together, because no additional %HH-encoding is necessary.
In the limit, it is   N*5/2+3   instead of  N*9.

These proposals, whether UTF-7 or UTF-8 or something else,
are very useful as a suggestion for where to head at, but unfortunately
they are not sufficient for the present-day reality, because most
file systems do not use these encodings (yet), and servers are not
(yet) constructed to do the necessary translations. So a proposal
from B) has to go together with a proposal from A) or C).


C) Server serving information on plain <-> nice mapping

An anyalysis of the URL schemes most in need of a solution
has shown that they coincide with the schemes that describe
documents serving protocols (ftp, http, gopher). Other schemes
are less needy ("mailto" has RFC1522, "news" explicitly only
allows ASCII in group names).

So some kind of protocol extension may be a solution.
If you have a plain form of an URL, and want to find its nice
form, or vice versa, just ask the server (which is defined
by the scheme and the host info, which is itself ASCII only)
about the encoding (MIME-registered "charset" parameter)
that defines the mapping between these two URL forms.

It looks like we need difficult-to-introduce protocol
extensions for this to work, but there is a simpler
solution. We define a single filename, for examlpe
"filename.enc" (note the 8.3 scheme). The only thing the
server administrator has to do is create such a file in
the central directory, consisting of a single line such as

ISO-8859-1

or whatever appropriate. Only the client has to
implement additional functionality, which it needs anyway.

Besides the small risk of a filename conflict, which can
be further reduced by choosing a different name, the
main problem is that there are cases where different resource
names on the same server use different encodings (e.g.
because the same server contains UNIX files, as well as
backups for a Mac or so). For this, the file can be changed
to a multiline format, or each directory can contain a file
of that name (with an entry in the parent directory that
indicates that the subdirectory should be examined).

Also, the proposals A) and C) do not care for the possibility
that different parts of an URL use different character encodings,
but this would cause big confusion even locally.

To ease the transition between the use of e.g. C) and the use
of a solution from B), the convention could be introduced
that servers that use B) change the conventional first
component of the domain name from "www" or "ftp"
to e.g. "utf7-www".


This has been a long list of proposals, and I would greatly
appreciate any comments regarding their feasibility, not
mainly in general, but for specific points such as errors and
oversigths.

Many thanks in advance, 	Martin.