Re: URL internationalization! from Francois Yergeau on 1997-02-21 (uri@w3.org from February 1997)

From: Francois Yergeau <yergeau@alis.com>
Date: Thu, 20 Feb 1997 23:11:04 -0500
To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Cc: URI mailing list <uri@bunyip.com>
Message-Id: <3.0.1.32.19970220231104.00a6fe44@genstar.alis.ca>
À 11:54 20-02-97 -0800, Roy T. Fielding a écrit :
>It would help a great deal if advocates of localization did not use
>the term internationalization; you are just creating unnecessary heat
>instead of solving the problem at hand.

In current terminology, as testified by a great many books and other
documents on the subject, internationalization is a process by which an
application, system, etc. is modified so that it can later support
localization by specialization instead of by extension.  You make the
system support multiple charsets (or a universal one), and use one (or part
of the universal one) in each locale.  You establish a generic message
mechanism, and use it with specific messages in each language.  And so on...

And I think this is very much the thrust of Martin's proposals:
internationalize URLs, so that they can reliably and deterministically be
localized anywhere.

>What Martin (and others) have suggested is that the existing requirements
>on internationalization are too severe.  In essence, he wants to make it
>legitimate for URLs to be localized (or lingua-centric), based on the
>conjecture that it is more important for locals to be able to use the
>most meaningful of names within a URL than it is that non-locals be able
>to use those same URLs at all.

Names within URLs are of relatively minor importance, and I wouldn't care
very much if it were the sole issue.  But URLs are also used for form
submission, may contain the subject of a (potential) mail message, and
possibly other such strings in present and future URL schemes, where it is
much more important, even crucial, to be able to represent non-ASCII
characters unambiguously.

>It is my opinion that URLs are, first and foremost, a uniform method of
>describing resource addresses such that they are usable by anyone in
>the world.  In my opinion, an address which has been localized at the
>expense of international usage is not a URL, or at least should be
>strongly discouraged.

I would agree with a somewhat modified version of the above: a pure address
(not a form submission or other such URL) which has been localized *is* a
URL, but should be discouraged on the grounds of its loss of universality.
Discouraged, not forbidden.

Not everyone cares about universality all the time.  There are cases where
actually using the non-ASCII characters you need is much more important
than allowing anyone in the world to type your URL.  The URL synatx should
facilitate that and make it as deterministic as possible, not forbid it or
make it available only by jumping through hoops.

>It is therefore my opinion that any attempt to increase the scope of
>the URL character set to include non-ASCII characters is a bad idea.
>This does not in any way restrict the nature of resources that can
>be addressed by a URL; it just means that the URL chosen should be an
>ASCII mapping, either one chosen by the user or one chosen automatically
>using the %xx encoding.  Yes, this is an inconvenience for non-English-
>based filesystems and resources, but that is the price to pay for true
>internationalization of resource access.

This is more than an inconvenience if the underlying character encoding is
left unspecified.  It is undeterministic.

Furthermore, I think universal access should be given no more than its
proper weight.  Those who want to provide universal access to a particular
resource SHOULD use %-encoding, but those who do not need, or even want, to
provide that should not be forced to use it for no good reason at all.  But
I'm flexible on that, the main problem is undeterminism when non-ASCII
characters are actually required.

>>+ To improve this, UTF-8 [RFC 2044] should be used to encode characters
>>+ represented by URLs wherever possible. UTF-8 is fully compatible with
>>+ US-ASCII, can encode all characters of the Universal Character Set,
>>+ and is in most cases easily distinguishable from legacy encodings
>>+ or random octet sequences.
>>+
>>+ Schemes and mechanisms and the underlying protocols are suggested
>>+ to start using UTF-8 directly (for new schemes, similar to URNs),
>>+ to make a gradual transition to UTF-8 (see
draft-ietf-ftpext-intl-ftp-00.txt
>>+ for an example), or to define a mapping from their representation
>>+ of characters to UTF-8 if UTF-8 cannot be used directly
>>+ (see draft-duerst-dns-i18n-00.txt for an example).
>>
>>[Comment: the references can be removed from the final text.]
>>
>>+ Note: RFC 2044 specifies UTF-8 in terms of Unicode Version 1.1,
>>+ corresponding to ISO 10646 without ammendments. It is widespread
>>+ consensus that this should indeed be Unicode Version 2.0,
>>+ corresponding to ISO 10646 including ammendment 5.
>
>None of the above belongs in this document.  That is the purpose of
>the "defining new URL schemes" document, which was previously removed
>from the discussion of the generic syntax.

These paragraphs address not only new schemes but also existing ones, HTTP
in particular, that badly need internationalization.  New schemes do not
need "a gradual transition to UTF-8", they can jump right in. These
paragraphs are crucially needed.

>That doesn't make any sense -- it is done every day.  Francois had a
>personal URL with a c-cedilla, 

Well I didn't, but I could have :-)  And others do, so...

>and it makes sense to admonish
>implementers that such things do occur and should not result in a
>system crash if such is avoidable.

Right.  In fact, not only the system MUST NOT crash, but it SHOULD behave
the same as if it had received the corresponding %XX.

Regards,


-- 
François Yergeau <yergeau@alis.com>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561
Received on Thursday, 20 February 1997 23:58:34 UTC