Re: I18N Concensus - Generic Syntax Document

Martin J. Duerst (mduerst@ifi.unizh.ch)
Thu, 6 Mar 1997 20:40:08 +0100 (MET)


Date: Thu, 6 Mar 1997 20:40:08 +0100 (MET)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Rich Petke <R.PETKE@csi.compuserve.com>
Cc: URI List <uri@bunyip.com>
Subject: Re: I18N Concensus - Generic Syntax Document
In-Reply-To: <CSI_6317-123060@CompuServe.COM>
Message-Id: <Pine.SUN.3.95q.970306203216.245a-100000@enoshima>

The distinction between the two lists, the URI list for the syntax
draft and the URL list for the process draft, seems confusing,
in particular for issues such as i18n of URLs, where both
drafts are affected.

On 6 Mar 1997, Rich Petke wrote:

> Comments on I18N issues in the generic URL syntax document seem to have ended.
> Martin has edited his comments to reflect input from these discussions and
> posted them at:
> 
> http://www.ifi.unizh.ch/groups/mml/mduerst/urli18n.html
> 
> Is there any more debate on the subject?  Does anyone else have comments,
> contributions?

The above has been done, but with respect to the process draft.


> We really need to move forward on the Generic URL Syntax document!

I fully agree with this! To help, I have integrated references
to the syntax draft discussion into the above URL, and I am sending
an updated proposal for changes to the syntax draft as below.
It is mainly based on the the comments made by Roy Fielding.

I do not repeat the parts that Roy has agreed to in
http://www.acl.lanl.gov/URI/archive/uri-96.messages/0241.html.



> 2. URL Characters and Character Escaping

>    URLs are sequences of characters. Parts of those sequences of
>    characters are then used to represent sequences of octets. In turn,
>    sequences of octets are (frequently) used (with a character
>    encoding scheme) to represent characters. This means that when
>    dealing with URLs it's necessary to work at three levels:
> 
>                      represented characters
>                                 ^
>                                 |
>                                 v
>                               octets
>                                 ^
>                                 |
>                                 v
>                          URL characters
> 
>    This looks more complicated than necessary if all one is dealing
>    with is file names in ASCII, but is necessary when dealing with the
>    wide variety of systems in use. URL characters may represent octets
>    directly or with escape sequences (Section 2.3). Octets may
>    sometimes represent characters in ASCII, in other character
>    encodings, or sometimes be used to represent data that does not
>    correspond to characters at all.

Add (this is CRUCIAL!):

+ In current practice, multiple and unspecified character
+ encoding schemes are used to represent the characters of the world.
+ This means that only the originator of the URL can determine which
+ character is represented by which octets.

[Roy agreed to this, with a slight change I have included.]


+ It is recommended that UTF-8 [RFC 2044] be used to represent characters
+ with octets in URLs, wherever possible.

+ For schemes where no single character->octet encoding is specified,
+ a gradual transition to UTF-8 can be made by servers make resources
+ available with UTF-8 names on their own, on a per-server or a
+ per-resource basis. Schemes and mechanisms that use a well-
+ defined character->octet encoding which is however not UTF-8 should
+ define the mapping between this encoding and UTF-8, because generic
+ URL software is unlikely to be aware of and to be able to handle
+ such specific conventions.

[Comment: I have removed specific examples from my previous text and
have made it shorter. But the essence remains the same.]

+ Note: RFC 2044 specifies UTF-8 in terms of Unicode Version 1.1,
+ corresponding to ISO 10646 without ammendments. It is widespread
+ consensus that this should indeed be Unicode Version 2.0,
+ corresponding to ISO 10646 including ammendment 5.

[This has to stay in as long as a new version of RFC 2044 is not available.]


> 2.3.3. Excluded Characters

[There was some discussion about wherein this section "character" should
be used, and where "octet". My reading of this section is that it deals
with the octet->URL character step, and that it therefore
has to tell which *octets* are excluded. In cases where these
correspond to ASCII character, "character" may be used, but
in case where it's absolutely unclear which characters the
octets correspond to, speaking about characters is not feasible.

To spare the list the details, I propose that Larry, Roy, and I
work on them privately. Maybe Larry can send the two others of
us a draft where he has integrated Roy's and my comments to this
"character" vs. "octet" matter.


>    Data corresponding to excluded characters must be escaped in order
>    to be properly represented within a URL.  However, there do exist
>    some systems that allow characters from the "unwise" and "national"
>    sets to be used in URL references (section 3); a robust
>    implementation should be prepared to handle those characters when
>    it is possible to do so.

Change to:

There exist some systems that allow characters/octets from the
"unwise" and "others" sets to be used in URL references (section 3).
Until a uniform representation for characters within URLs is firmly
established, such practice is not stable with respect to transcoding
and therefore should be avoided.
However, robust implementations should be prepared to handle those
octet values when it is possible to do so.

[This is an integration of my earlier proposal and Roy's comments.]



I am looking forward to further comments.

Regards,	Martin.