Re: [URN] Re: URI documents

Patrik =?iso-8859-1?Q?F=E4ltstr=F6m?= (paf@swip.net)
Wed, 07 Jan 1998 16:12:14 +0100


Message-Id: <3.0.3.32.19980107161214.006b611c@nix.swip.net>
Date: Wed, 07 Jan 1998 16:12:14 +0100
To: Larry Masinter <masinter@parc.xerox.com>
From: Patrik =?iso-8859-1?Q?F=E4ltstr=F6m?= <paf@swip.net>
Subject: Re: [URN] Re: URI documents
Cc: Harald Tveit Alvestrand <Harald.Alvestrand@maxware.no>,
In-Reply-To: <34B336E2.56E4F403@parc.xerox.com>

At 00:03 1998-01-07 PST, Larry Masinter wrote:
>I should point out that the syntax (and any scheme-specific semantics)
>are assigned to the character sequence, not to any octet sequence.
>In fact, the mapping of character sequences to octet sequences is
>part of the semantics that a scheme specifies. That's the reason
>why some schemes might employ different encoding mechanisms than
>%XX.

I don't agree with this, but it might be because the overloaded use of the
word "character".

The way I interpret what you are saying is that a URI parser (yes, a URI
parser) should operate on the _characters_ in the URI string and not the
octets?

That means, that I should be able to use percent encoding of the fragment
identifier, and still have the fragment delimiter, which in turn means that
the encoding does not have any meaning at all.

I.e. what I am talking about, and I think we agree on, is that we have to
define "characters", and we also have to agree on what octets are valid on
various levels in the chain of parsing URIs. I see that we have four layers:

Client
    [BIG5]
  Maps between nativ charset to some known
  which is specified in the schema definition.
    [UNICODE]
URI string
    [UNICODE]
  This is mapped into whatever the translitteration
  string is defined to be according to the
  _URI_SYNTAX_ document.
    [UTF-8 encoded UNICODE]
Translitterated string
    [UTF-8 encoded UNICODE]
  Here we can do some %-encoding if needed.
    [String in "US-ASCII"]
URI sequence of bytes


The processes above are described in various documents, and I want
everything from the translitterated string and downwards to be described in
a URI syntax document, while what is above the translitterated string
should go in a URL/URN syntax document and various schema definition
documents.

When _I_ talk about characters, I talk about characters in the URI string,
while the URI syntax document when talking about the fragment delimiter '#'
as being forbidden in a URI, talks about the "Translitterated string". I.e.
semantics for schemes are on the URI string, while syntax and semantics for
URIs are on the tranlitterated string.

    Patrik


Email: paf@swip.net            URL: http://www.tele2.se
PGP: 4D38 91A4 27D9 C8B2 6975  D6BB 21D0 4C57 BD23 6602