Re: draft-fielding-uri-rfc2396bis-03 (section 2) from Martin Duerst on 2003-06-27 (uri@w3.org from June 2003)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 27 Jun 2003 17:24:51 -0400
To: "Roy T. Fielding" <fielding@apache.org>, uri@w3.org
Message-Id: <4.2.0.58.J.20030620142608.025a4b48@localhost>
Roy's mail is up again, so here I'm sending the last part of
my comments, on section 2. I thought it might be easier
to do these comments directly on the text.


second paragraph:
At 15:34 03/06/27 -0400, you wrote:
>2. Characters
>
>    A URI consists of a restricted set of characters, primarily chosen
>    to aid transcription and usability both in computer systems and in
>    non-computer communications.  Characters used conventionally as
>    delimiters around a URI are excluded.  The set of URI characters
>    consists of digits, letters, and a few graphic symbols chosen from
>    those common to most of the character encodings and input facilities
>    available to Internet users.
>
>       uric        = reserved / unreserved / escaped

This syntax rule and the term 'URI characters' do not really
resolve the ambiguity whether 'URI characters' refers to the
characters used for the surface representation of URIs
(i.e. ALPHA, DIGITS, marks, reserved, and '%'), or to the
octets (sometimes inappropriately called characters) that
the URI encodes. Also, the 'uric' production is not used a
single time in the whole draft. Is it needed?


>    Within a URI, reserved characters are used to delimit syntax
>    components, unreserved characters are used to describe registered
>    names, and unreserved, non-delimiting reserved,

"non-delimiting reserved" is a totally new term, not explained
anywhere. Please remove or explain.


>and escaped
>    characters

'escaped characters': What's that? Is '%9B' an 'escaped character'?
Or is the octet (not character!) that's behind '%9B' an 'escaped character'?
Where I have no problem with the 'escaped' production, I don't
think the term 'escaped character' is appropriate either way.
'%9B' is three characters, and the thing behind it is not a
character at all, it is an octet.


>are used to represent strings of data (1*OCTET) within the
>    components.

I would suggest that we decide on a clear and correct terminology
here, and then apply it through out the section/document.



>2.1 Encoding of Characters
>
>    As described above (Section 1.3), the URI syntax is defined in terms
>    of characters by reference to the US-ASCII encoding of characters to
>    octets.

See my comments to section 1. Also, it should make clear that this
is the US-ASCII encoding as used in the ABNF, e.g. the fact that
"%x30-34" is standing for the characters with US-ASCII hexadecimal
30-34, i.e. characters '0'-'4', which can be encoded in UTF-16
(Java), EBCDIC, paper, braille dots, or whatever.


>This specification does not mandate the use of any
>    particular mapping between its character set

Please replace all occurrences of 'character set'
with 'set of characters' or 'repertoire of characters'. Many people,
up to this day, mistakenly associate encoding with the term
'character set'.  For details, see
http://www.w3.org/MarkUp/html-spec/charset-harmful.


>and the octets used to
>    store or transmit those characters.

Please use an example here (see above), and mention the fact that
that you don't need octets to store or transmit the characters.


>    URI characters representing strings of data

'strings of characters' is often used, but 'strings of data'
sounds a bit strange. Maybe just 'data'?


>within a component may,
>    if allowed by the component production, represent an arbitrary
>    sequence of octets.  For example, portions of a given URI might
>    correspond to a filename on a non-ASCII file system, a query on
>    non-ASCII data, numeric coordinates on a map, etc.  Some URI schemes
>    define a specific encoding of raw data to US-ASCII characters as part
>    of their scheme-specific requirements.

I would remove the last sentence, because it implies that schemes
can change the correspondence of octets and URI syntax, which I
think is a bad idea to describe things.


>Most URI schemes represent
>    data octets by the US-ASCII character corresponding to that octet,
>    either directly in the form of the character's glyph or by use of an
>    escape triplet (Section 2.4).

- 'Most URI schemes': Are there any than don't?
- What US-ASCII character corresponds to octet 0xBC?
- Why do we need to say 'the character's glyph' here? It doesn't have
   to appear visually, it just has to be the specific character.

I would propose the following text:

URIs represent data octets by the corresponding character in
US-ASCII if such a character is available and is not reserved
for syntactic purposes, and with the corresponding escape triplet
otherwise. For example, the octet <41> (hexadecimal) is represented
by the character 'A', the octet <BC> (hexadecimal) is represented
by the escape triplet '%BC'.

URI schemes may define additional steps (e.g. the use of Base64)
before the step described above for the encoding of data into URIs.



>    When a URI scheme defines a component that represents textual data
>    consisting of characters from the Unicode (ISO 10646) character set,
>    we recommend that the data be encoded first as octets according to
>    the UTF-8 [UTF-8] character encoding, and then escaping only those
>    octets that are not in the unreserved character set.

This is very well intended, and good, but can be improved. First,
'When' seems to imply that this is sometimes the case. But actually,
it's the norm, with very few exceptions. Second, 'from the Unicode
(ISO 10646) character set' may give the impression that this only
applies when Unicode/ISO 10646 is actually used. So I suggest:

Most components in URI schemes represent textual data. For URI schemes
defining a component that represents textual data, we recommend that
the data be encoded first as a sequence of octets according to
the UTF-8 [UTF-8] character encoding, and then escaping according
to the rules above. So for example the character 'A' would be represented
as 'A', the character LATIN CAPITAL LETTER A WITH GRAVE would be
represented as %C3%80, and the character KATAKANA LETTER A would
be represented as %E3%82%A2.

Some URI schemes do not specify which character encoding should
be used for exposing textual data in URIs. In this case, we also
recommend that data be exposed first as a sequence of octets according
to the UTF-8 [UTF-8] character encoding, and then escaping according
to the rules above.

[The first paragraph above covers schemes such as imap: and pop:
and urn:s, the second paragraph covers e.g. http.]


Section 2.1 in RFC 2396 contained more material, in particular also
the following small diagram:

    URI character sequence->octet sequence->original character sequence

In discussion on the W3C TAG list, I proposed to clarify this area
of the specification with a more extended diagram. Why have neither
of these diagrams been used?


>2.2 Reserved Characters
>
>    URIs include components and sub-components that are delimited by
>    certain special characters.  These characters are called "reserved",
>    since their usage within a URI component is limited to their reserved
>    purpose within that component.  If data for a URI component would
>    conflict with the reserved purpose, then the conflicting data must be
>    escaped (Section 2.4) before forming the URI.
>
>       reserved    = "/" / "?" / "#" / "[" / "]" / ";" /
>                     ":" / "@" / "&" / "=" / "+" / "$" / ","
>
>    Reserved characters are used as delimiters of the generic URI
>    components described in Section 3, as well as within those components
>    for delimiting sub-components.  A component's ABNF syntax rule will
>    not use the "reserved" production directly; instead, each rule lists
>    those reserved characters that are allowed within that component.
>    Allowed reserved characters that are not assigned a sub-component
>    delimiter role by this specification should be considered reserved
>    for special use by whatever software generates the URI (i.e., they
>    may be used to delimit or indicate information that is significant to
>    interpretation of the identifier, but that significance is outside
>    the scope of this specification).

This sentence is way unclear. An example may help.
Does it mean that in a 'path' component, e.g. the '@' character
can be used, but not as a simple part of the path component, but
only in an (undefined) special delimiter role?


>Outside of the URI's origin, a
>    reserved character cannot be escaped without fear of changing how it
>    will be interpreted; likewise, an escaped octet that corresponds to a
>    reserved character cannot be unescaped outside the software that is
>    responsible for interpreting it during URI resolution.

URIs are often escaped overall when being put into another URI as
a parameter. Does the text above mean that this is forbidden?
Again, examples might help.


>    The slash ("/"), question-mark ("?"), and number-sign ("#")
>    characters are reserved in all URIs for the purpose of delimiting
>    components that are significant to the generic parser's hierarchical
>    interpretation of an identifier.  The hierarchical prefix

There are no slashes in the scheme part. So it would be better to
use the term 'hierarchical part' (hier-part) or path.


>of a URI,
>    wherein the slash ("/") character signifies a hierarchy delimiter,
>    extends from the scheme (Section 3.1)

extends from the scheme -> starts after the scheme (if present)



>through to the first
>    question-mark ("?"), number-sign ("#"), or the end of the URI string.
>    In other words, the slash ("/") character is not treated as a
>    hierarchical separator within the query (Section 3.4) and fragment
>    (Section 3.5) components of a URI, but is still considered reserved
>    within those components for purposes outside the scope of this
>    specification.

Again, I don't get the idea of "reserved ... for purposes outside the
scope of this specification.". What does that mean? The slash is
legal in a fragment

     fragment      = *( pchar / "/" / "?" )

but you seem to say that without some special additional permission,
I'm not allowed to use it.


>2.3 Unreserved Characters
>
>    Characters that are allowed in a URI but do not have a reserved
>    purpose are called unreserved.  These include uppercase and lowercase
>    letters, decimal digits, and a limited set of punctuation marks and
>    symbols.
>
>       unreserved  = ALPHA / DIGIT / mark
>
>       mark        = "-" / "_" / "." / "!" / "~" / "*" / "'" / "(" / ")"
>
>    Escaping unreserved characters in a URI does not change what resource
>    is identified by that URI.  However, it may change the result of a
>    URI comparison (Section 6), potentially leading to less efficient
>    actions by an application.  Therefore, unreserved characters should
>    not be escaped unless the URI is being used in a context that does
>    not allow the unescaped character to appear. URI normalization
>    processes may unescape sequences in the ranges of ALPHA (%41-%5A and
>    %61-%7A), DIGIT (%30-%39), hyphen (%2D), underscore (%5F), or tilde
>    (%7E) without fear of creating a conflict, but unescaping the other
>    mark characters is usually counterproductive.

See my comment to Tim's mail.


>2.4 Escaped Characters
>
>    Data must be escaped if it does not have a representation using an
>    unreserved character; this includes data that does not correspond to
>    a printable character of the US-ASCII coded character set or
>    corresponds to a US-ASCII character that delimits the component from
>    others, is reserved in that component for delimiting sub-components,
>    or is excluded from any use within a URI (Section 2.5).
>
>2.4.1 Escaped Encoding
>
>    An escaped octet is encoded

This mixes before and after. Either it's already escaped, and therefore
encoded, or it's not yet escaped. Change to: To escape an octet, the
octet is encoded...


>as a character triplet, consisting of
>    the percent character "%" followed by the two hexadecimal digits
>    representing that octet's numeric value.  For example, "%20" is the
>    escaped encoding for the binary octet "00100000" (ABNF: %x20)

Don't use ABNF here, because ABNF %x20 is a 'space', independent
of it's encoding.


>, which
>    corresponds to the US-ASCII space character (SP).

There is no special US-ASCIIness in the space character, nor is there
a space character in US-ASCII that's different from space characters
in different encodings. So 'corresponds to the US-ASCII space
character (SP)' -> 'in US-ASCII corresponds to the space character (SP)'.



>This is sometimes
>    referred to as "percent-encoding" the octet.

The term "%-escaping" is probably used more than "percent-encoding".


>       escaped     = "%" HEXDIG HEXDIG
>
>    The uppercase hexadecimal digits 'A' through 'F' are equivalent to
>    the lowercase digits 'a' through 'f', respectively.  Two URIs that
>    differ only in the case of hexadecimal digits used in escaped octets
>    are equivalent.  For consistency, we recommend that uppercase digits
>    be used by URI generators and normalizers.
>
>2.4.2 When to Escape and Unescape

It seems that quite a bit of material that's now in 2.1 should go here,
or the other way round.


>    Under normal circumstances, the only time that characters within a
>    URI string are escaped is during the process of generating the URI
>    from its component parts.  Each component may have its own set of
>    characters that are reserved, so only the mechanism responsible for
>    generating or interpreting that component can determine whether or
>    not escaping a character will change its semantics.  The exception is
>    when a URI is being used within a context where the unreserved "mark"
>    characters might need to be escaped, such as when used for a
>    command-line argument or within a single-quoted attribute.

While it may not always be avoidable, it is in general a bad idea
to use URI escaping for such a purpose. We should clearly state that
whenever escaping is available in the context the URI appears, the
escaping mechanism of the context rather than URI escaping should be
used. As an example, the URI http://example.org/O'hare, when put
into a single-quoted attribute, should be
<foo bar='http://example.org/O&#x27;hare'> or
<foo bar='http://example.org/O&apos;hare'>, but not
<foo bar='http://example.org/O%27hare'>.


>    Once generated, a URI is always in an escaped form.  When a URI is
>    resolved, the components significant to that

'that' -> 'the'


>scheme-specific
>    resolution process (if any) must be parsed and separated before the
>    escaped characters within those components can be safely unescaped.
>
>    In some cases, data that could be represented by an unreserved
>    character may appear escaped; for example, some of the unreserved
>    "mark" characters are automatically escaped by some systems.  A URI
>    normalizer

What is an URI normalizer? When does it get used?



>may unescape escaped octets that are represented by
>    characters in the unreserved set.  For example, "%7E" is sometimes
>    used instead of tilde ("~") in an "http" URI path and can be
>    converted to "~" without changing the interpretation of the URI.
>
>    In all cases, a URI character is equivalent to its corresponding
>    ASCII-encoded octet, even when that octet is represented as a
>    percent-escape. URI characters are provided as an external ASCII
>    interface for identification between systems.

I would remove the 'ASCII' from this sentence, and say:

"URIs are provided as an external interface for identification between 
systems."



>A system that
>    internally provides identifiers in the form of a different character
>    encoding, such as EBCDIC, will generally perform character
>    translation of textual identifiers to UTF-8 at some internal
>    interface, thus providing meaningful identifiers in ASCII even though
>    the back-end identifiers are in a different encoding.  Escaped octets
>    must be unescaped before such a transcoding is applied.

This is well intended, but a mess. The translation in the sentence
above is from EBCDIC to UTF-8, and unescaping before that makes no
sense. Don't mix directions.

There should be another sentence saying something like:

A system that uses an ASCII-based encoding different from UTF-8
should also ...



>Although
>    this specification does not define the character encoding of escaped
>    octets outside the ASCII range, the general principle of unescaping
>    before transcoding should be applied for all character encodings.
>
>    Because the percent ("%") character serves as the escape indicator,
>    it must be escaped as "%25" in order for that octet to be used as
>    data within a URI.  Implementers should be careful not to escape or
>    unescape the same string more than once, since unescaping an already
>    unescaped string might lead to misinterpreting a percent data
>    character as another escaped character, or vice versa in the case of
>    escaping an already escaped string.
>
>2.5 Excluded Characters
>
>    Although they are disallowed within the URI syntax, we include here
>    a description of those characters that have been excluded and the
>    reasons for their exclusion.
>
>       excluded    = invisible / delims / unwise
>
>    The control characters (CTL) in the US-ASCII coded character set are
>    not used within a URI, both because they are non-printable and
>    because they are likely to be misinterpreted by some control
>    mechanisms. The space character (SP) is excluded because significant
>    spaces may disappear and insignificant spaces may be introduced when
>    a URI is transcribed, typeset, or subjected to the treatment of
>    word-processing programs.  Whitespace is also used to delimit a URI
>    in many contexts. Characters outside the US-ASCII set are excluded as
>    well.

To use the term 'characters' here is okay, because any other characters
are indeed excluded. But together with %x80-FF in the rule below,
this suggests that there are some characters corresponding to these
octets. As the correspondence is established (by ABNF) using US-ASCII,
there obviously can't be any characters there. So I suggest replacing
%x80-FF with <characters outside US-ASCII> or some such.



>       invisible   = CTL / SP / %x80-FF
>
>    The angle-bracket ("<" and ">") and double-quote (") characters are
>    excluded because they are often used as the delimiters around a URI
>    in text documents and protocol fields.  The percent character ("%")
>    is excluded because it is used for the encoding of escaped (Section
>    2.4) characters.
>
>       delims      = "<" / ">" / "%" / DQUOTE
>
>    Other characters are excluded because gateways and other transport
>    agents are known to sometimes modify such characters.

'are known' -> 'have been known'. There may still be such things
around, but they are more and more a thing of the past.



>       unwise      = "{" / "}" / "|" / "\" / "^" / "`"
>
>    Data octets corresponding to excluded characters must be escaped in
>    order to be represented within a URI.



This concludes my review.



Regards,    Martin.



At 19:07 03/06/06 -0700, Roy T. Fielding wrote:

>I have just submitted draft 03, which can also be obtained via
>the issues list at
>
>    http://www.apache.org/~fielding/uri/rev-2002/issues.html
>
>Please note that all issues have been fixed or closed.  If you'd
>like to raise a new issue or reopen an old one, please do so
>within the next two weeks.
>
>
>Cheers,
>
>Roy T. Fielding, Chief Scientist, Day Software
>                  2 Corporate Plaza, Suite 150
>                  Newport Beach, CA 92660-7929   fax:+1.949.644.5064
>                  (roy.fielding@day.com) <http://www.day.com/>
>
>                  Co-founder, The Apache Software Foundation
>                  (fielding@apache.org)  <http://www.apache.org/>
Received on Friday, 27 June 2003 17:24:57 UTC