Re: proposed rewrite of 2.1 of draft-fielding-uri-syntax ...

Larry Masinter (masinter@parc.xerox.com)
Wed, 7 Jan 1998 02:40:53 PST


Message-ID: <34B35BB5.106273BE@parc.xerox.com>
Date: Wed, 7 Jan 1998 02:40:53 PST
From: Larry Masinter <masinter@parc.xerox.com>
To: "Patrik =?iso-8859-1?Q?F=E4ltstr=F6m?=" <paf@swip.net>
CC: uri-i18n@unicode.org, fielding@ics.uci.edu, uri@bunyip.com,
Subject: Re: proposed rewrite of 2.1 of draft-fielding-uri-syntax ...

Patrik:

The document intended for Draft Standard is mainly saying that it is
*NOT SOLVING* the URI internationalization problem. We're not doing that
here. If you want to work on solving that problem, I suggest you
start with 'draft-masinter-url-i18n-00.txt', which Martin Duerst is
now editing.

> This is a good start, but I definitely think that the part talking about
> UTF-8 have to talk more about multibyte character sets, which will give the
> best example of what the difference is between a "URI character sequence"
> and "original character sequence".

The section is already (in my judgement) too long. We should not
spend more space to explain at length a problem which the document
doesn't solve. The purpose is to point out that it isn't solved here,
but not to do more than that.

> Ultimately I would want to have different words for the character in
> US-ASCII which the octet in the URI represents and the character in the URI
> (which can be represented by more than one character in US-ASCII).

See draft-masinter-url-i18n, not draft-fielding-uri-syntax.

> An example here would help. For example (I think this is what you are
> saying?):
> 
>    http://foo.com/%31.html -> http://foo.com/A.html
> 
>    URI character sequence     Original characters
> 
> (I might calculate by hex value for A wrong...)


I suppose I really *do* need to put in an example. The mapping
is:

   http://foo.com/%3l.html  -> scheme="http", site="foo.com", path=["%3l.html"]
     URI character sequence           parsed URI components
    "%3l.html" ->  nnn nnn nnn nnn nnn nnn -> "A.html" 
    URI characters      octets               original characters

 
> To conclude, we have a three level mapping, which is as follows:
> 
>   Original characters -> Translitterated string -> URI sequence

There are three steps to decoding

 a) parse the URI characters
 b) (maybe) translate component URI character sequences to octet sequences
 c) (maybe) translate the the octet sequences to original characters

> What the URI scheme papers should talk about are the "Original characters"

The "original characters" are not printed on paper. The only thing that
should be printed on paper are the URI characters.

> and how the mappings to the translitterated strings should be done (i.e.
> from what is printed on paper, what is equality between two such
> strings...),

These topics are covered in the url-i18n draft.