Message-ID: <34B35BB5.106273BE@parc.xerox.com> Date: Wed, 7 Jan 1998 02:40:53 PST From: Larry Masinter <firstname.lastname@example.org> To: "Patrik =?iso-8859-1?Q?F=E4ltstr=F6m?=" <email@example.com> CC: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, Subject: Re: proposed rewrite of 2.1 of draft-fielding-uri-syntax ... Patrik: The document intended for Draft Standard is mainly saying that it is *NOT SOLVING* the URI internationalization problem. We're not doing that here. If you want to work on solving that problem, I suggest you start with 'draft-masinter-url-i18n-00.txt', which Martin Duerst is now editing. > This is a good start, but I definitely think that the part talking about > UTF-8 have to talk more about multibyte character sets, which will give the > best example of what the difference is between a "URI character sequence" > and "original character sequence". The section is already (in my judgement) too long. We should not spend more space to explain at length a problem which the document doesn't solve. The purpose is to point out that it isn't solved here, but not to do more than that. > Ultimately I would want to have different words for the character in > US-ASCII which the octet in the URI represents and the character in the URI > (which can be represented by more than one character in US-ASCII). See draft-masinter-url-i18n, not draft-fielding-uri-syntax. > An example here would help. For example (I think this is what you are > saying?): > > http://foo.com/%31.html -> http://foo.com/A.html > > URI character sequence Original characters > > (I might calculate by hex value for A wrong...) I suppose I really *do* need to put in an example. The mapping is: http://foo.com/%3l.html -> scheme="http", site="foo.com", path=["%3l.html"] URI character sequence parsed URI components "%3l.html" -> nnn nnn nnn nnn nnn nnn -> "A.html" URI characters octets original characters > To conclude, we have a three level mapping, which is as follows: > > Original characters -> Translitterated string -> URI sequence There are three steps to decoding a) parse the URI characters b) (maybe) translate component URI character sequences to octet sequences c) (maybe) translate the the octet sequences to original characters > What the URI scheme papers should talk about are the "Original characters" The "original characters" are not printed on paper. The only thing that should be printed on paper are the URI characters. > and how the mappings to the translitterated strings should be done (i.e. > from what is printed on paper, what is equality between two such > strings...), These topics are covered in the url-i18n draft.