Re: [URN] Re: URI documents

Larry Masinter (masinter@parc.xerox.com)
Wed, 7 Jan 1998 10:02:44 PST


Message-ID: <34B3C344.D2B80DAA@parc.xerox.com>
Date: Wed, 7 Jan 1998 10:02:44 PST
From: Larry Masinter <masinter@parc.xerox.com>
To: "Patrik =?iso-8859-1?Q?F=E4ltstr=F6m?=" <paf@swip.net>
CC: Harald Tveit Alvestrand <Harald.Alvestrand@maxware.no>,
Subject: Re: [URN] Re: URI documents

This point is really hard to make, apparently, and the
current text fails to make it. I'd appreciate any suggestions
for how to word this to make it clearer.

I said:
> >I should point out that the syntax (and any scheme-specific semantics)
> >are assigned to the character sequence, not to any octet sequence.
> >In fact, the mapping of character sequences to octet sequences is
> >part of the semantics that a scheme specifies. That's the reason
> >why some schemes might employ different encoding mechanisms than
> >%XX.
> 
And Patrik replied:

> The way I interpret what you are saying is that a URI parser (yes, a URI
> parser) should operate on the _characters_ in the URI string and not the
> octets?
> 
> That means, that I should be able to use percent encoding of the fragment
> identifier, and still have the fragment delimiter, which in turn means that
> the encoding does not have any meaning at all.

No. The URI  b://a/%2Ec

contains the CHARACTERS "b", ":", "/", "/", "a", "/", "%", "2", "F", "c".

At this level, the "%", "2", and "F" are just characters. They should NOT
be decoded, scanned, parsed, or treated in any special way prior to parsing.
The mechanism by which the sequence "%", "2", "F" is turned into a single
octet MUST NOT be applied until AFTER the URI has been scanned.

If you have "b://a/%2Ec" in EBCDIC, or in UTF-16 (which uses double bytes
for representing sequences of characters), you should parse the URI
in the native encoding for the delimiters "/", "%", ":", etc., and then
take the remaining character sequences scheme=["b"], site=["a"],
path=["%2Fc"], and, based on the scheme, turn the remaining components
into octet sequences.

> I.e. what I am talking about, and I think we agree on, 

apparently not

>                                                    is that we have to
> define "characters", and we also have to agree on what octets are valid on
> various levels in the chain of parsing URIs.

Some of the levels don't operate on "octets", so that doesn't make
sense.

> I see that we have four layers:
> 
> Client
>     [BIG5]
>   Maps between nativ charset to some known
>   which is specified in the schema definition.
>     [UNICODE]
> URI string
>     [UNICODE]
>   This is mapped into whatever the translitteration
>   string is defined to be according to the
>   _URI_SYNTAX_ document.
>     [UTF-8 encoded UNICODE]
> Translitterated string
>     [UTF-8 encoded UNICODE]
>   Here we can do some %-encoding if needed.
>     [String in "US-ASCII"]
> URI sequence of bytes

I don't understand this layering, and don't think that "UNICODE"
is appropriate at these levels.

> The processes above are described in various documents,

Then you should give references, since the processes you've described
aren't familiar to me.

>                                          and I want
> everything from the translitterated string and downwards to be described in
> a URI syntax document,

You get what you see, which is a description of the mapping at the
layer of the URI syntax, and a description of a common, frequent,
and useful encoding of octets by sequence of characters which is
common to many URI schemes.

>                 while what is above the translitterated string
> should go in a URL/URN syntax document and various schema definition
> documents.

Not all schemes will use the same encoding.

> When _I_ talk about characters, I talk about characters in the URI string,
> while the URI syntax document when talking about the fragment delimiter '#'
> as being forbidden in a URI, talks about the "Translitterated string". I.e.
> semantics for schemes are on the URI string, while syntax and semantics for
> URIs are on the tranlitterated string.

Patrik: a "character" is an abstract concept, as in an "octet". You're
free to talk about characters in the URI string, but we have to talk
about characters in multiple contexts. Given how difficult it has been
to arrive at the current terminology and framework, I don't want to
upset the rough consensus of the expert community in order to fit into
your way of conceptualizing this relationship. So: I don't accept your
proposal that this section be reworded to match your conceptualization.
If what's there isn't CLEAR, then I can try to improve it; if there's
some incompatbility with some other documents, we will have to resolve
that incompatibility, but if it's just that YOU think about it in a
different way, I hope you can find a way to see the world from a different
perspective.

Regards,

Larry
-- 
http://www.parc.xerox.com/masinter