Internationalization of URLs

Hello everybody,

This is about INTERNATIONALIZATION of URLs!

[first, appologies if you receive this mail more than once]

As many of you probably are aware of, action is needed to get
URL internationalization to seriously move in the right direction.

As I have participated in many discussions and also done talks
on this issue in the last Unicode conference and at the Multilinguism
workshop in Sevilla, I feel some responsibility to get things
ahead. However, I think I need some support, because the people
concerned with URLs in general are not overly friendly to
the idea of internationalizing URLs.

Nevertheless, I think we have a strong case, because for some
parts of URLs (query part), i18n is undeniably necessary, and
for others, not doing it is putting people and communities using
something else than basic Latin at an unnecessary disadvantage.
The concerns that i18n URLs cannot be keyboarded, transcribed,
or mailed may have had some significance in the past, but are
not valid anymore. For example, it's easy to offer a keyboarding
service for the whole of UNicode on the web.

This mail is sent to some lists that deal with internationalization,
and outlines the solution that others and I think is most feasible.
It then describes the proposals I will make around Feb 18th on the
relevant lists (unfortunately there are two of them; see below).

If you think that this issue is important, and the solution taken
makes sense, I would greatly apprciate it if you could have a look
at these lists and contribute your oppinion. See the end of this
mail for instructions as to how to subscribe to the lists. If you
have any questions, I will also gladly answer them.



Overview
--------

0. The Current State
1. The Solution
2. How to Get There
3. The Process
4. The Proposals
5. What You Can Do



0. The Current State
--------------------

URLs currently are sequences of a limited subset of ASCII characters.
To encode other characters, arbitrary character encodings ("charset"s)
are used, and the resulting octets are encoded using the %HH escaping
mechanism.

The first step of this is highly unsatisfactory because
it cannot be inverted, i.e. you cannot guess the characters from
the octets, and because it cannot even be done without some additional
information, i.e. if you know the characters of a file name, you
can't construct the URL, because you don't know the encoding
that the server will expect.

The above is current practice, and can't be changed overnight,
especially because a lot of protocols and software is affected.
But it is possible to specify the direction to go, and it's
possible to arrive at the goal sooner or later.



1. The Solution
---------------

To solve the problem that character encodings are not known,
there could be different solutions. Using tags (a la RFC 1522)
is ugly because these tags would have to be carried around
on paper and everywhere. Having every scheme and mechanism use
different conventions is also rather impracticable because this
leads to confusion (e.g. if the dns part of an URL is different
from the path part because dns and http use different conventions).
Inventing new schemes for internationalization (e.g. ftp->iftp)
is likewise too clumsy.

The solution that remains is to specify preference for a single
character encoding. Because this has to encompass UCS and
be compatible with ASCII, there are mainly two solutions, UTF-7
and UTF-8. UTF-7, however, has various disadvantages. Some
of the characters it uses can have special functions inside
URLs. It is not completely compatible with ASCII. And it
has some potential limitations with regards to UCS4
(although the latest point is rather theoretical).

UTF-8 is not perfect, either. But the problem that it uses
8 bits is solved by the already existing %HH-encoding.
The problem that with %HH-ecoding, some URLs can get rather
long, will be addressed by the direct use of the characters
themselves where possible (see below).

UTF-8 has an additional addvantage, namely that it is
with high probability easily distinguishable from legacy
encodings. This was one of the reasons it was adopted for
internationalizing FTP (see draft-ietf-ftpext-intl-ftp-00.txt).
URNs also have adopted UTF-8 and %HH-encoding (see
draft-ietf-urn-syntax-02.txt). This is important because
URLs and URNs are both part of URIs, and are otherwise
kept in synch syntacticatly. Also, the general recommendation
from the Internet Architecture Board is to use UTF-8 in
situations where ASCII-compatibility is necessary
(see draft-weider-iab-char-wrkshop-00.txt).

This has lead to the conclusion, in the various discussions I have
been involved, that UTF-8 is the solution. A good documentation
on this can found at http://www.alis.com:8085/~yergeau/url-00.html).
Actually, Francois Yergeau was faster than me to realize that UTF-8
is the way to go, and we wouldn't be where we are now if it weren't
for his efforts.



2. How to Get There
-------------------

In Sevilla, I proposed two steps to move towards the solution.
The FIRST STEP is to specify UTF-8 as the preferred character
encoding for URLs, while leaving URLs unchanged otherwise.
The two important points here are "no enforcement" and
"canonical form".

Due to current use, it's impossible to inforce anything anyhow.
And some Unicode critics are alergic if they are forced to use
Unicode. Also, URLs are a "hat" over many very different protocols
and mechanisms, which all have their own communities and special
needs. Note that if a protocol or mechanism does not use UTF-8 for
i18n, it is still possible (and preferable) to use UTF-8 in URLs
and to define a scheme-specific mapping from UTF-8 to that specific
encoding. Examples would be the "modified UTF-7" defined for IMAP
[RFC 2060] and the "UTF-5" defined in my proposal for domain name
i18n (draft-duerst-dns-i18n-00.txt).

Canonical form means that %HH-escaping only is used, and no 8-bit
or native encoding. This is important tactically because most people
concerned with URLs in general are very much affraid that anything
else cannot be handled (typed/transcribed/mailed). Although they
are mostly wrong on this, it's difficult to convince them in a
short time.
Of course, even with canonical form only, it's always possible
to offer a nice interface hiding the %HH (once you know how the
conversion is done!).


The SECOND STEP is of course that the characters can be
used directly where appropriate, e.g. in HTML documents.
It is important to notice here that to carry URLs in such
documents, it's not binary UTF-8 octets that will be used
(unless the whole document is UTF-8, and except for
those parts of the URL that are in %HH-escaping).
The characters are carried in the native encoding of
the document! This is important to make transcoding
and editing by the user very easy and natural, and
it conforms to what is done with the ASCII characters
e.g. in a document encoded in EBCDIC.



3. The Process
--------------

The two important documents on URLs that are currently been
worked on are the Syntax Draft and the Process Draft. The
Syntax Draft is an improvement and regroupting of earlier
RFCs on URLs (mainly RFC 1738 and RFC 1630). The current
version is draft-fielding-url-syntax-03.txt. This work is
carried out without an IETF working group; discussions are
carried out on the otherwise disfunct URI mailing list.
The main function of the Syntax Draft is to define URL
syntax, including encoding of octets/characters and
handling of relative URL references.

The Process Draft is the responsibility of a new IETF
working group. The mailing list of this working group
is the URL mailing list. The main purpose of this work
is to define conditions and process for creating new
URLs.



4. The Proposals
----------------

On Feb 18th I will propose changes/additions to both drafts
on the respective lists. These changes, if accepted, will
correspond to STEP ONE (UTF-8 as the preferred character
encoding, with %HH escaping). My current guess is that there
is a good chance for these changes to get accepted, IF there
is enough support for them. My proposal last December
for a similar change/addition to the Syntax Draft was
ignored because nobody seconded it, although there was
no opposition to it.
I'm sorry I have to bother you and ask for your help,
but experience in the past has shown that wide support
is crucial for i18n issues to get accepted.



5. What You Can Do
------------------

If you agree that we need to move towards consistent
internationalization of URLs, and that what I described
above is the right direction to go, you should subscribe
to the two mailing lists in question (traffic there is
currently low), and contribute to the discussion.
A lot of short contributions from many people are better
than few long contributions all from a small group.


To subscribe to the URI list (Syntax Draft), send a mail
with the text
	subscribe uri <your mail address>
to
	majordomo@services.bunyip.com
The archives for this list are at
	http://www.acl.lanl.gov/URI/archive/archives.html


To subscribe to the URL list (Process Draft), send a mail
with the text
	subscribe
to
	ietf-url-request@imc.org
The mailing list archive of this list is at
	http://www.imc.org/ietf-url/

To unsubscribe later, follow the instructions that the
mailing list software will send you when you subscribe.


To get the RFCs and internet-drafts mentionned above,
contact your nearest repository. As an example,
RFCs can be found as
	ftp://nic.nordu.net/rfc/rfcNNNN.txt
where NNNN is the RFC number. Internet-drafts can be found as
	ftp://nic.nordu.net/internet-drafts/<draft-name>
Please note that RFCs and internet-drafts can have a trailing .Z
if they are compressed, so you have to try with and without .Z.



With kind regards,	Martin Du"rst.

----
Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch
----

Received on Friday, 14 February 1997 08:20:53 UTC