URLs and internationalization

Hello everybody,

This is hopefully the last of the series of mails regarding the
URL syntax draft (of course discussion may follow). This series
of mails may have created, with some of you, the impression that
I think there is nothing good in the current draft. If this
should be the case, I appologize from my heart. I think that it
is a very good draft, and I would like it to become even better.
And I think that this can be acomplished without unnecessary
time delays.

So let me get to the point of internationalization (i18n) of URLs.
Currently, URLs are not in a very good state re. i18n, and many
people doubt whether that can be improved. I think it can. If you
look at the discussion in ftp-wg, the URN syntax draft, the IAB
charset workshop report (draft-weider-iab-char-wrkshop-00.txt),
my draft on domain name internationalization (draft-duerst-dns-i18n-00.txt),
and in particular http://www.alis.com:8085/~yergeau/url-00.html,
you will see that there is one direction we should go, namely UTF-8.

There are also some people that think that URL i18n should never happen.
I have addressed some of their concerns in my mail about transcribability.

The draft currently in very many places does the right thing, if
not to further URL i18n, then at least to not make it more difficult
in the future, and to not create too much legacy cases that make
transition more difficult. Below, I will both mention these cases
and those parts where I think change is needed to keep the doors
open for the future.

As there seems strong interest to get finished soon with the draft,
it would probably be too time-consuming to include a full i18n
solution, including transitory provisions, into it. I therefore
propose to write (myself) a separate document on URL i18n. I hope
the newly forming working group will adopt it as one of their documents,
and will integrate the relevant portions of it into the "URL schemes
requirements" document that is currently the main focus of the new
group. I also volunteer to participate as author/editor of that
document, to take care of i18n and related issues.


After these preliminaries, let's have a look at the current syntax
draft:

> 1.4. Syntax Notation and Common Elements

>    Unlike many specifications which use a BNF-like grammar to define the
>    bytes (octets) allowed by a protocol, the URL grammar is defined in
>    terms of characters.  Each literal in the grammar corresponds to the
>    character it represents, rather than to the octet encoding of that
>    character in any particular coded character set.  How a URL is
>    represented in terms of bits and bytes on the wire is dependent upon
>    the character encoding of the protocol used to transport it, or the
>    charset of the document which contains it.

Good! If URLs might ever be extended beyond their canonical form,
and decently internationalized, that will not have to be changed at
all.


> 2.3.1. Escaped Encoding
> 
>    An escaped character is encoded as a character triplet, consisting of
>    the percent character "%" followed by the two hexadecimal digits
>    representing the character's octet code in an 8-bit coded character
>    set.  For example, "%20" is the escaped encoding for the space
>    character.
>    
>       escaped     = "%" hex hex
>       hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
>                             "a" | "b" | "c" | "d" | "e" | "f"
> 
>    The 8-bit coded character set of the octet must be a superset of the
>    US-ASCII coded character set, such that the US-ASCII characters have
>    the same escaped encoding regardless of the larger octet character
>    set.

I commented on this in terms of protocol autonomy. There are some
important concerns also for i18n. It is nice to see that people
think URLs make more sense if the characters they represent can
be identified. But it is extremely assuming and unfair to
request that without exception for ASCII, whereas there is no
guarantee whatsoever for the rest of the world.

I therefore propose that the paragraph:

>    The 8-bit coded character set of the octet must be a superset of the
>    US-ASCII coded character set, such that the US-ASCII characters have
>    the same escaped encoding regardless of the larger octet character
>    set.

be dropped/eliminated/removed. I also strongly suggest that the
draft be reverted to the "octet"->"character" model as in the
previous RFC. I suggest that the language from that RFC is taken.


> The coded character set chosen must correspond to the character
>    set of the mechanism that will interpret the URL component in which
>    the escaped character is used.  A sequence of escape triplets are
>    used if the character is coded as a sequence of octets.

It makes ample sense that the mapping from URL to the octets used in the
mechanism is deterministic and well specified, without any external
information. But there is no need that the %HH-deencoded octets
correspond exactly to what is used by the mechanism. For a good
example of why this is so, please see my draft-duerst-dns-i18n-00.txt.

I therefore propose that the above paragraph be removed, and
be replaced by:

+ The definition of individual URL schemes must assure that the
+ mapping from the resource identification to an URL and from
+ the URL to the mechanisms and protocols required to access
+ the resource are defined unambiguously.


>    Any character, from any character set, can be included in a URL via
>    the escaped encoding, provided that the mechanism which will
>    interpret the URL has an octet encoding for that character.  However,
>    only that mechanism (the originator of the URL) can determine which
>    character is represented by the octet.  A client without knowledge of
>    the origination mechanism cannot unescape the character for display.

This is the current, deplorable state. It's not satisfying at all.
It can be changed, not overnight, but step by step. As a preparation,
I propose replacement by the following (assuming that generally speaking,
the draft is changed to "octet"->"character":


+ The octets encoded in the URL will in many cases in turn encode
+ characters. In current practice, various encodings are used,
+ which means that only the originator of the URL can determine
+ which character is represented by which octets.
+
+ It can be expected that in the future, UTF-8 [RFC 2044], which
+ is fully compatible with US-ASCII, will
+ be the encoding of choice for URL components. Schemes and
+ mechanisms and the underlying protocols are suggested to
+ start using UTF-8 directly (for new schemes, similar to [URN]),
+ to make a gradual transition to UTF-8 (see draft-ietf-ftpext-intl-ftp-0?.txt
+ for an example), or to define a mapping from their representation
+ of characters to UTF-8 if UTF-8 cannot be used directly
+ (see draft-duerst-dns-i18n-0?.txt for an example).

This proposal may seem quite daring to many of you. But it
is in nice accordance with a well known previous case:
The specification, in RFC1866, of ISO 10646 as the
"future document character set" for HTML. And it is less
strict; no interpretation of octets in terms of UTF-8 is
required, and no encoding of represented characters in terms
of UTF-8 is required (whereas RFC 1866 requires interpretation
of numeric character references in terms of ISO 10646).
The big advantage of this proposal is also that the many readers
of this document will be alerted to an issue and will be able
to judge by themselves.


> 2.3.3. Excluded Characters

>    Excluded characters must be escaped in order to be properly
>    represented within a URL.  However, there do exist some systems that
>    allow characters from the "unwise" and "national" sets to be used in
>    URL references; a robust implementation should be prepared to handle
>    those characters when it is possible to do so.

This is very dangerous. It sounds as if some systems could deal
with such cases, but with "charset" labeling of document content
and increased use of transcoding, this will work less and less!

This paragraph should therefore be changed as follows:

+ Excluded octets must be escaped in all cases in order to be
+ properly represented, transmitted, and transcoded within an URL.
+ There exist some systems that allow the unescaped use of such
+ octets [and the characters they represent]. As long as and
+ for those components where there is no uniform solution
+ (see [the last proposed text]), the consistency of the
+ URLs over various transports and transcodings cannot be
+ guaranteed in any way.



Enjoy the holydays,	Martin.

Received on Friday, 20 December 1996 17:33:34 UTC