URL internationalization! from Martin J. Duerst on 1997-02-18 (uri@w3.org from February 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Tue, 18 Feb 1997 16:09:03 +0100 (MET)
To: URI mailing list <uri@bunyip.com>
Message-Id: <Pine.SUN.3.95q.970218160814.245I-100000@enoshima>
Hello everybody,

[Please appologize if you receive this mail twice.]

In the next mail to this list and its counterpart, I will make
proposals for changes/additions to the syntax or the process draft.
These changes are designed to set the direction for consistent
internationalization (i18n) of URLs in the future.
[The current state is best characterized by the word "chaos".]

As is customary for standards, language is limited to the facts,
without much background. To give you this background, and show
that the direction pointed to is very reasonable, I am sending
this mail with detailled explanations to show that it actually
works. If you have any further questions, I will be glad to
answer them. If the changes proposed are approved, I will
convert and expand the main part of this mail into an internet
draft.

My proposals to the URL process and syntax draft are based on
many discussions in conferences and on IETF mailing lists, and
talks on this issue in the last Unicode conference and at the
Multilinguism workshop in Sevilla.




Overview
--------

0. The Current State
1. The Solution
2. The Proposals
3. The Second Step
4. Backwards Compatibility and Upgrade Path



0. The Current State
--------------------

URLs currently are sequences of a limited subset of ASCII characters.
To encode other characters, arbitrary and unspecified character
encodings ("charset"s) are used, and the resulting octets are encoded
using the %HH escaping mechanism.

The first step of this is highly unsatisfactory because
it cannot be inverted, i.e. you cannot derive the characters from
the octets, and because even forward conversion cannot be done without
some additional information, i.e. if you know the characters of a file
name, you can't construct the URL, because you don't know the encoding
that the server will expect.

The above is current practice, and can't be changed overnight,
especially because a lot of protocols and software is affected.
But it is possible to specify the direction to go, and to start
moving in that direction.

Maybe you ask why caring about the internationalization of URLs
is necessary at all. Well, for some parts of URLs (in particular
the query part), i18n is undeniably necessary, and for others,
not doing it is putting people and communities using something
else than basic Latin at an unnecessary disadvantage.
Of course, no information provider will be forced to use these
features. But as an example, take the Hong Kong Turist Association
adventising Hong Kong in the US and in Japan. It will only help
them if the URL to their Japanese site is as understandable,
memorizable, transcribable, or guessable for Japanese speakers
as the ASCII URL is for English speakers.

The arguments that i18n URLs cannot be keyboarded, transcribed,
or mailed may have had some significance in the past, but are
getting less and less valid. On the Web, with forms and Java,
it's easy to offer a keyboarding service for the whole of UNicode.

Anyway, the proposals I am making here are not at all going that
far. I am just mentionning these things above and below to show
that it's actually workable and desirable. Discussions on the
URN list have shown that some people are still very sceptical
about transcribability. The work on HTML 2.0 and HTML i18n
has shown that it is possible to move towards i18n in a
well-defined way even at a rather late stage.



1. The Solution
---------------

To solve the problem that character encodings are not known,
there could be different solutions. Using tags (a la RFC 1522)
is ugly because these tags would have to be carried around
on paper and everywhere. Having every scheme and mechanism use
different conventions is also rather impracticable because this
leads to confusion (e.g. if the dns part of an URL is different
from the path part because dns and http use different conventions).
Inventing new schemes for internationalization (e.g. ftp->iftp)
is likewise too clumsy.

The solution that remains is to specify preference for a single
character encoding. Because this has to encompass a very wide
range of characters, the Universal Character Set (UCS) of
Unicode/ISO 10646 is the only solution. To be compatible with
ASCII, there exist two UCS Transform Formats, UTF-7 and UTF-8.
UTF-7, however, has various disadvantages. Some of the characters
it uses can have special functions inside URLs. It is not completely
compatible with ASCII. And it has some potential limitations with
regards to the full space of UCS (although the later point is rather
theoretical).

UTF-8, designed by Rob Pike and others at Bell labs, is not perfect
for URLs either. But it is fully ASCII-compatible, and doesn't
interfere with URL syntax. Also, the problem that it uses all 8 bits
per octet is solved by the widely deployed %HH-encoding of URLs.
The problem that with %HH-ecoding, some URLs can get rather long,
will hopefully be addressed by the direct use of the characters
themselves somewhere in the future (see below).

UTF-8 has an additional addvantage, namely that it is
with high probability easily distinguishable from legacy
encodings. This was one of the reasons it was adopted for
internationalizing FTP (see draft-ietf-ftpext-intl-ftp-00.txt).
URNs also have adopted UTF-8 and %HH-encoding (see
draft-ietf-urn-syntax-02.txt). This is important because
URLs and URNs are both part of URIs, and are otherwise
kept in synch syntacticatly. Also, the general recommendation
from the Internet Architecture Board is to use UTF-8 in
situations where ASCII-compatibility is necessary
(see draft-weider-iab-char-wrkshop-00.txt).

This has lead to the conclusion, in the various discussions I have
been involved, that UTF-8 is the solution. A good documentation
on this can found at http://www.alis.com:8085/~yergeau/url-00.html).
Actually, Francois Yergeau was faster than me to realize that UTF-8
is the way to go, and we wouldn't be where we are now if it weren't
for his efforts.



2. The Proposals
----------------

In Sevilla, I proposed two steps to move towards the solution.
The first step is to specify UTF-8 as the preferred character
encoding for URLs, while leaving URLs unchanged otherwise.
This brings the target for URLs in sync with the spec for URNs.
For reasons already explained, the proposals for changes to the
syntax and process draft I make cover only the first step.
For most people caring about i18n, this looks like "not enough".
But it is important to set the basic infrastructure first.
Once we have that, the rest will come almost without effort.

The two important points in the first step are "no enforcement"
and "canonical form" (a better name for the later might be
lowest common denominator).

Canonical form means that %HH-escaping is compulsory, and no 8-bit
or native encoding is allowed. This is important to stay in sync
with URNs, to allow UTF-8 URL software and actual URLs to deploy
before raising the expectations of users too much, and to address
the concerns about transcribability of those in the URI community
not very familliar with i18n issues.
Of course, as with URNs, even with the canonical form only, it
will be possible rather quickly to offer nice user interfaces
hiding the %HH (once you know how the conversion is done!).

Due to current use, it is impossible to mandate any character
encoding for URLs; a lot of URLs would immediately become illegal.
Also, URLs are a "hat" over many very different protocols and
mechanisms, which all have their own communities and special
needs. Some URLs, such as the "data:" URL, do not encode characters
at all (but most actually do :-). Others might have very specific
requirements that preclude the use of UTF-8, or may not need
i18n at all for some good reason. However, please note that
there is no need for the protocol or mechanism itself to use
UTF-8 for internationalization. Assuming the protocol or mechanism
is internationalized, with a well-defined way to encode the characters
of UCS, a mapping from UTF-8 to that specific encoding can be defined.
This is preferable to exposing the scheme-specific i18n in URLs,
because it greatly simplifies URL handling and user interface
issues. For examples of scheme- or mechanism-specific i18n,
please see IMAP [RFC 2060, section 5.1.3], defining "modified UTF-7",
and the "UTF-5" from my proposal for domain name i18n
(draft-duerst-dns-i18n-00.txt).



3. The Second Step
------------------

Once the direction towards using UTF-8 as the character encoding
for URL is firmly set, and other i18n aspects are nicely covered
by software such as browsers and servers, it will be time for
step two, I hope. This means that %HH-escaping will not be
mandatory anymore. Note that this does not mean that binary
UTF-8 will be placed in the middle of an ASCII-only document,
or in a Latin-X document, or in a UCS2/UTF-16 document.

In these cases, and in all others where the character encoding
is implicitly or explicitly known, those characters that can
be represented natively will be encoded using the native encoding.
For the rest of the characters, UTF-8 and %HH-encoding will
be used. This is crucial to make transcoding, for example
in response to an "Accept-Charset:" header in HTTP, as well
as e.g. when cutting/pasting text in a GUI interface, work
correctly. It also conforms what is done currently with the
basic URL characters e.g. in a document encoded in EBCDIC.



4. Backwards Compatibility and Upgrade Path
-------------------------------------------

Backwards compatibility is fully guaranteed. No URL that was
legal up to now is made illegal. No URL that works now is
stopping to work. And it is actually possible to convert
to using UTF-8 gradually, without the servers depending on
all clients to be upgraded or vice versa. And even those
cases that already use native document encoding for their
URLs, and rely on them being interpreted octet-wise, can
be dealt with, to the extent they have been working up to
now.

If a server e.g. has a Latin-X filesystem locally, and gets an URL
that could be UTF-8, it converts it to Latin-X and looks if the
resource is there. If the resource is not found, it tests using
the raw octets directly. Because of the heuristic separation between
UTF-8 and other encodings, and the usually extremely sparse use of
namespace on a server (except for forms input, see below), chances
for collision are exorbitantly small. If somebody is not satisfied
with this, it's easy to add an extension that checks all current
resource names and assures that there are no conflicting resource
names, or that in case of conflict, the user gets a choice.

A similar thing can happen on the browser side. A native URL,
in whatever encoding, is converted to UTF-8 and used to probe the
server. If it fails, another probe is made with the raw octet form
of the native URL. Initially, it could be done the other way round,
as long as that can be expected to yield more hits. Caching of the
encoding that led to the last hit, on whatever granularity desirable,
can help to reduce unnecessary connections. And the time penalty is
not that significant, because the current use of native character
representation with the expectation of binary interpretation has
never been guaranteed to work anyway.

Now for those URLs that the server creates on the fly, in particular
the query part that results from form input. This currently is the
biggest headache to people actually working on i18n of browsers and
servers, and an uniform convention to use UTF-8 would improve the
situation a lot. Unfortunately, however, the namespace in this case
is more tightly populated. A server has a reasonable chance, depending
on the character encodings it is expecting, to detect UTF-8. But a
client doesn't have many clues about the server.

Because this is a rather old problem, several solutions have been
proposed and some of them are (partially) implementated. The first
is (in the context of HTTP) to use a GET with a multipart body,
where each of the parts has an appropriate "charset" tag. The second
is to incorporate a hidden field into the FORM, and guess the encoding
by reasonably assuming that the client is sending the same characters
back, although maybe in a different encoding, and that all fields
use the same character encoding. The third is to use the encoding
of the document that was sent, with UTF-8 also in the case of
UCS2/UTF-16 or raw UCS4. However, it is not sure that on a server
that can transcode to Unicode, all the scripts will be able to
accept Unicode. The fourth is to use the "Accept-Charset=" attribute
defined in RFC 2070 for the <INPUT> and <TEXTAREA> elements in
HTML <FORM>s to signal to the client that the server can handle
UTF-8 forms input. If the legacy text encoding for query parts is
known, this could even be handled transparently by the server.
Because these features are only defined in HTML, it is important
to agree on a single encoding to not bother other formats with
similar hassles.




Again, if you have any questions or comments, please contact me.


With kind regards,	Martin.

----
Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch
----
Received on Tuesday, 18 February 1997 10:08:38 UTC