Re: URL internationalization! from Roy T. Fielding on 1997-02-20 (uri@w3.org from February 1997)

From: Roy T. Fielding <fielding@kiwi.ICS.UCI.EDU>
Date: Thu, 20 Feb 1997 11:54:48 -0800
To: URI mailing list <uri@bunyip.com>
Message-Id: <9702201154.aa16860@paris.ics.uci.edu>
First, I want to get some terminology straight.  The issue at hand is
not internationalization, since the only international character
set at the current time is US-ASCII (i.e., ISO-647).  No, I don't mean
that US-ASCII is capable of representing all languages -- it isn't.
What I mean is that it is the only character set that is displayable
and typeable on almost all, if not all, computers in use on the Internet.

It would help a great deal if advocates of localization did not use
the term internationalization; you are just creating unnecessary heat
instead of solving the problem at hand.

What Martin (and others) have suggested is that the existing requirements
on internationalization are too severe.  In essence, he wants to make it
legitimate for URLs to be localized (or lingua-centric), based on the
conjecture that it is more important for locals to be able to use the
most meaningful of names within a URL than it is that non-locals be able
to use those same URLs at all.

It is my opinion that URLs are, first and foremost, a uniform method of
describing resource addresses such that they are usable by anyone in
the world.  In my opinion, an address which has been localized at the
expense of international usage is not a URL, or at least should be
strongly discouraged.  This is, I think, one of the basic philosophies
behind the URI design, and what I tried to describe in the URL syntax
document.  It is one of the key reasons why URIs succeeded where all
other attempts at a uniform address syntax have failed.

It is therefore my opinion that any attempt to increase the scope of
the URL character set to include non-ASCII characters is a bad idea.
This does not in any way restrict the nature of resources that can
be addressed by a URL; it just means that the URL chosen should be an
ASCII mapping, either one chosen by the user or one chosen automatically
using the %xx encoding.  Yes, this is an inconvenience for non-English-
based filesystems and resources, but that is the price to pay for true
internationalization of resource access.

Nevertheless, I am not one to believe in forcing, by way of standard,
a technological solution to a social problem.  If people want to create
locals-only URLs, I am not the kind of person to stand in their way.
However, I am the kind of person who would tell them they are being
shortsighted, and I believe that kind of guidance should remain in
the specification.

In regards to the changes proposed by Martin J. Duerst:

>>       2. Need a specific reference to the documents
>>          defining Content-Base and Content-Language.
>
>Can somebody explain the issue with Content-Langage?

I think Larry meant to write Content-Location.  The reference is RFC 2068
on HTTP/1.1.

>> 1.3. URL Transcribability
>> 
>>    The URL syntax has been designed to promote transcribability as one
>>    of its main concerns. A URL is a sequence of characters, i.e., letters,
>>    digits, and special characters.  A URL may be represented in a
>
>change one sentence:
>
>A URL is a sequence of characters from a very limited set, i.e. the
>letters of the basic Latin alphabet, digits, and some special characters.
>
>[Justification: "character" is used in different circumstances and
>senses later. It is important to make things clear up front.]

That seems like a good idea.

>>    These design concerns are not always in alignment.  For example, it
>>    is often the case that the most meaningful name for a URL component
>>    would require characters which cannot be typed on most keyboards.
>>    The ability to transcribe the resource
>>    location from one medium to another was considered more
>>    important than having its URL consist of the most meaningful of
>>    components.
>
>Add:
>In local and regional contexts and with improving technology, users
>may greatly benefit from being able to use a wider range of characters.
>However, at the current point of time, such use is not guaranteed to
>work, and should therefore be avoided.

I would strike the word "greatly", but otherwise this is true.

>Should be 2.1 (and the rest of section numbers in Chapter 2
>changed apprpriately.

Yep.

>Add (this is CRUCIAL!):
>
>+ In current practice, all kinds of arbitrary and unspecified character
>+ encoding schemes are used to represent the characters of the world.
>+ This means that only the originator of the URL can determine which
>+ character is represented by which octets.

Replace "all kinds of arbitrary and" with "multiple" and its okay.
There is nothing arbitrary about it.  However, the wording that existed
in earlier drafts was considerably better, since it didn't preclude an
application from showing what it did know about the character encoding.

>+ To improve this, UTF-8 [RFC 2044] should be used to encode characters
>+ represented by URLs wherever possible. UTF-8 is fully compatible with
>+ US-ASCII, can encode all characters of the Universal Character Set,
>+ and is in most cases easily distinguishable from legacy encodings
>+ or random octet sequences.
>+
>+ Schemes and mechanisms and the underlying protocols are suggested
>+ to start using UTF-8 directly (for new schemes, similar to URNs),
>+ to make a gradual transition to UTF-8 (see draft-ietf-ftpext-intl-ftp-00.txt
>+ for an example), or to define a mapping from their representation
>+ of characters to UTF-8 if UTF-8 cannot be used directly
>+ (see draft-duerst-dns-i18n-00.txt for an example).
>
>[Comment: the references can be removed from the final text.]
>
>+ Note: RFC 2044 specifies UTF-8 in terms of Unicode Version 1.1,
>+ corresponding to ISO 10646 without ammendments. It is widespread
>+ consensus that this should indeed be Unicode Version 2.0,
>+ corresponding to ISO 10646 including ammendment 5.

None of the above belongs in this document.  That is the purpose of
the "defining new URL schemes" document, which was previously removed
from the discussion of the generic syntax.

>> 2.3.3. Excluded Characters
>
>Change to "Characters and Octets"
>
>>    Although they are not used within the URL syntax, we include here a
>>    description of those US-ASCII characters which have been excluded
>>    and the reasons for their exclusion.
>
>Change "US-ASCII characters" to "US-ASCII characters and other octets"

I'll leave that to Larry's judgement, since the reemphasis of octets
over characters may have left some confusion in the document.

>>       excluded    = control | space | delims | unwise | national
>
>Change "national" to "others". There is nothing particularly
>national about octet values above 0x7F. There is also nothing
>particularly national about a character such as A-grave. It is
>used in many languages, by many nations.

Okay -- it was just a leftover from the old BNF.

>>    All characters corresponding to the control characters in the
>
>Change "characters" to "octets".

The first ocurrence, yes.  Larry, please write these changes such
that they still make sense when the URL is pasted on a billboard sign
instead of in a protocol stream.

>Up to here, it's easier to speak about characters. But from here
>on, it's definitely easier and clearer to speak about octets.
>
>>    Finally, all other characters besides those mentioned in the above
>>    sections are excluded because they are often difficult or impossible
>>    to transcribe using traditional computer keyboards and software.
>
>Change to:
>
>Finally, octet values above 0x7F are excluded because with the
>current lack of a common convention for encoding the characters
>they represent, they can neither be transcribed nor transcoded
>reliably.

No, we are still talking about characters here -- octets are not
relevant to whether or not A-grave is excluded.  The existing paragraph
is better than the proposed change.

>>       national    = <Any character not in the reserved, unreserved,
>>                      control, space, delims, or unwise sets>
>
>Change to:
>
>	others	= <any octets with values above 0x7F>

No -- "others" is fine, but the BNF definition must remain as is in
order to correctly define URLs that have no representation in bytes.

>>    Data corresponding to excluded characters must be escaped in order
>>    to be properly represented within a URL.  However, there do exist
>>    some systems that allow characters from the "unwise" and "national"
>>    sets to be used in URL references (section 3); a robust
>>    implementation should be prepared to handle those characters when
>>    it is possible to do so.
>
>It is not "possible to do so", so the above does not make sense.

That doesn't make any sense -- it is done every day.  Francois had a
personal URL with a c-cedilla, and it makes sense to admonish
implementers that such things do occur and should not result in a
system crash if such is avoidable.

Hmmm, I used to have a section/paragraph on why clients can't convert
%xx encodings to characters for the purpose of display unless they
have some knowledge of the character set of the underlying URL-creation
process, as is the case for all filesystem URLs which are local to
the client.  It is unfortunate that it was deleted, since I was going
to suggest that if the scheme defines that only a single character
encoding can be used for creating the %xx encoding, then the client does
have sufficient knowledge to display that data in its natural form.

 ...Roy T. Fielding
    Department of Information & Computer Science    (fielding@ics.uci.edu)
    University of California, Irvine, CA 92697-3425    fax:+1(714)824-4056
    http://www.ics.uci.edu/~fielding/
Received on Thursday, 20 February 1997 15:39:10 UTC