URL internationalization!

Hello everybody,

In the last few days, I have been discussing URLs and
internationalization, and have felt that it's really
time to clear this up.

Those people really concerned with the issues have been
thinking about it a long time, and think that UTF-8 is
the way to go, in the same way that ISO 10646 was the
right thing to specify as a document character set
for HTML. I have explained the details in a previous

Besides the documents for each scheme (which of course
are waiting for at least a hint to a solution on the
general level), it's the syntax draft and the process
draft that are affected. Here I am providing a proposal
for changes to the syntax draft, the changes to the process
draft are discussed in the url mailing list.

The syntax draft has made enormous progress, in particular,
the version 03 contains some valuable improvements over the
02 draft. But as there are still a lot of open issues
(see the list at the beginning), so we won't loose time
nor anything else to get internationalization on the right
track, too, in particular because there is a well-thought-out
solution available now.

The wording of the changes as proposed below is based on
my experience from the work on HTML, URNs, FTP, and so on.
The main ideas are to clearly specify a direction while not
including things that won't work currently, and also not to
overload this document with details (which can and will go
into other documents). I know very well that many people in
the i18n community will say that these changes don't go
far enough, and I agree with them. But I think it is best
to proceed step-by-step, and to check the results after
each step.

> Network Working Group                                     T. Berners-Lee
> INTERNET-DRAFT                                                   MIT/LCS
> <draft-fielding-url-syntax-03>                               R. Fielding
> Expires six months after publication date.                   U.C. Irvine
>                                                              L. Masinter
>                                                        Xerox Corporation
>                                                         29 December 1996
>                     Uniform Resource Locators (URL)

>    Issues:
>       1. We need to define a mechanism for using IPv6 addresses in the
>          URL hostname which will not break existing systems too badly.
>       2. Need a specific reference to the documents
>          defining Content-Base and Content-Language.

Can somebody explain the issue with Content-Langage?
It looks like it is related to internationalization.
However, I didn't find it anywhere in the draft.

>       3. Examples should include one with multiple parameters and
>          one with multiple queries.
>       4. Suggestion to include a 'normalization' algorithm. Should we?
>       5. Is there semantics to empty fragment identifiers?
>       6. clarify issue with http://4kids/blah, where non FQDN is used.
>       7. Add [MHTML] reference   
>       8. URN/URI/URL issue

9. Internationalization
	[This point can be removed after the proposed
	  changes are integrated. :-)]

> 1.3. URL Transcribability
>    The URL syntax has been designed to promote transcribability as one
>    of its main concerns. A URL is a sequence of characters, i.e., letters,
>    digits, and special characters.  A URL may be represented in a

change one sentence:

A URL is a sequence of characters from a very limited set, i.e. the
letters of the basic Latin alphabet, digits, and some special characters.

[Justification: "character" is used in different circumstances and
senses later. It is important to make things clear up front.]

>    variety of ways: e.g., ink on paper, pixels on a screen, or a
>    sequence of octets in a coded character set.  The interpretation of a
>    URL depends only on the characters used and not how those characters
>    are represented on the wire.

>    There are several design concerns revealed by the scenario:
>       o  A URL is a sequence of characters, which is not always
>          represented as a sequence of octets.
>       o  A URL may be transcribed from a non-network source, and thus
>          should consist of characters which are most likely to be able
>          to be typed into a computer, within the constraints imposed by
>          keyboards (and related input devices) across languages and
>          locales.
>       o  A URL often needs to be remembered by people, and it is easier
>          for people to remember a URL when it consists of meaningful
>          components.
>    These design concerns are not always in alignment.  For example, it
>    is often the case that the most meaningful name for a URL component
>    would require characters which cannot be typed on most keyboards.
>    The ability to transcribe the resource
>    location from one medium to another was considered more
>    important than having its URL consist of the most meaningful of
>    components.

In local and regional contexts and with improving technology, users
may greatly benefit from being able to use a wider range of characters.
However, at the current point of time, such use is not guaranteed to
work, and should therefore be avoided.

>    In a few cases, exceptions were made for characters already in
>    widespread use within URLs: the "~", "$" and "#" characters might
>    have otherwise been excluded from URLs.

> 2. URL Characters and Character Escaping
>    All URLs consist of a restricted set of characters, primarily chosen
>    to aid transcribability and usability both in computer
>    systems and in non-computer communications. In addition, characters
>    used conventionally as delimiters around URLs were excluded.  The
>    restricted set of characters consists of digits, letters, and a few
>    graphic symbols corresponding to a subset of the graphic printable
>    characters of the US-ASCII coded character set [12]; they are
>    common to most of the character encodings and input facilities
>    available to Internet users.
>    Within a URL, characters are either used as delimiters, or to
>    represent strings of data (octets) within delimited portions.  When
>    used to represent data directly, the character denotes the octet
>    corresponding to the US-ASCII code for that character.  In
>    addition, an octet may be represented by an escaped encoding.
>    Thus, the set of "characters" allowed within URLs can be described in
>    three categories: reserved, unreserved, and escaped.
>       urlc        = reserved | unreserved | escaped
> 1.5. Characters, octets, and encodings

Should be 2.1 (and the rest of section numbers in Chapter 2
changed apprpriately.

>    URLs are sequences of characters. Parts of those sequences of
>    characters are then used to represent sequences of octets. In turn,
>    sequences of octets are (frequently) used (with a character
>    encoding scheme) to represent characters. This means that when
>    dealing with URLs it's necessary to work at three levels:
>                      represented characters
>                                 ^
>                                 |
>                                 v
>                               octets
>                                 ^
>                                 |
>                                 v
>                          URL characters
>    This looks more complicated than necessary if all one is dealing
>    with is file names in ASCII, but is necessary when dealing with the
>    wide variety of systems in use. URL characters may represent octets
>    directly or with escape sequences (Section 2.3). Octets may
>    sometimes represent characters in ASCII, in other character
>    encodings, or sometimes be used to represent data that does not
>    correspond to characters at all.

Add (this is CRUCIAL!):

+ In current practice, all kinds of arbitrary and unspecified character
+ encoding schemes are used to represent the characters of the world.
+ This means that only the originator of the URL can determine which
+ character is represented by which octets.

+ To improve this, UTF-8 [RFC 2044] should be used to encode characters
+ represented by URLs wherever possible. UTF-8 is fully compatible with
+ US-ASCII, can encode all characters of the Universal Character Set,
+ and is in most cases easily distinguishable from legacy encodings
+ or random octet sequences.
+ Schemes and mechanisms and the underlying protocols are suggested
+ to start using UTF-8 directly (for new schemes, similar to URNs),
+ to make a gradual transition to UTF-8 (see draft-ietf-ftpext-intl-ftp-00.txt
+ for an example), or to define a mapping from their representation
+ of characters to UTF-8 if UTF-8 cannot be used directly
+ (see draft-duerst-dns-i18n-00.txt for an example).

[Comment: the references can be removed from the final text.]

+ Note: RFC 2044 specifies UTF-8 in terms of Unicode Version 1.1,
+ corresponding to ISO 10646 without ammendments. It is widespread
+ consensus that this should indeed be Unicode Version 2.0,
+ corresponding to ISO 10646 including ammendment 5.

> 2.3.3. Excluded Characters

Change to "Characters and Octets"

>    Although they are not used within the URL syntax, we include here a
>    description of those US-ASCII characters which have been excluded
>    and the reasons for their exclusion.

Change "US-ASCII characters" to "US-ASCII characters and other octets"

>       excluded    = control | space | delims | unwise | national

Change "national" to "others". There is nothing particularly
national about octet values above 0x7F. There is also nothing
particularly national about a character such as A-grave. It is
used in many languages, by many nations.

>    All characters corresponding to the control characters in the

Change "characters" to "octets".

>    US-ASCII coded character set are unsafe to use within a URL, both
>    because they are non-printable and because they are likely to be
>    misinterpreted by some control mechanisms.
>       control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
>    The space character is excluded because significant spaces may
>    disappear and insignificant spaces may be introduced when URLs are
>    transcribed or typeset or subjected to the treatment of
>    word-processing programs.  Whitespace is also used to delimit URLs in
>    many contexts.
>       space       = <US-ASCII coded character 20 hexadecimal>
>    The angle-bracket "<" and ">" and double-quote (`"') characters are
>    excluded because they are often used as the delimiters around URLs in
>    text documents and protocol fields.  The character "#" is excluded
>    because it is used to delimit a URL from a fragment identifier in URL
>    references (Section 3). The percent character "%" is excluded because it is used
>    for the encoding of escaped characters.
>       delims      = "<" | ">" | "#" | "%" | <">
>    Other characters are excluded because gateways and other transport
>    agents are known to sometimes modify such characters, or they are
>    used as delimiters.
>       unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Up to here, it's easier to speak about characters. But from here
on, it's definitely easier and clearer to speak about octets.

>    Finally, all other characters besides those mentioned in the above
>    sections are excluded because they are often difficult or impossible
>    to transcribe using traditional computer keyboards and software.

Change to:

Finally, octet values above 0x7F are excluded because with the
current lack of a common convention for encoding the characters
they represent, they can neither be transcribed nor transcoded

>       national    = <Any character not in the reserved, unreserved,
>                      control, space, delims, or unwise sets>

Change to:

	others	= <any octets with values above 0x7F>

>    Data corresponding to excluded characters must be escaped in order
>    to be properly represented within a URL.  However, there do exist
>    some systems that allow characters from the "unwise" and "national"
>    sets to be used in URL references (section 3); a robust
>    implementation should be prepared to handle those characters when
>    it is possible to do so.

It is not "possible to do so", so the above does not make sense.
Change to:

There exist some systems that allow characters/octets from the
"unwise" and "others" sets to be used in URL references (section 3).
Until a uniform representation for characters within URLs is firmly
established, such practice does not work and should be avoided.

> 8. References

Add [RFC2044]:    F. Yergeau, "UTF-8, A Transformation Format of Unicode
                  and ISO 10646", Alis Technologies, October 1996.

I am looking forward to your comments and questions.

Regards,	Martin.

Received on Tuesday, 18 February 1997 08:55:57 UTC