URIEquivalence-15: characters in RFC 2396 (was: Re: [Minutes] 27 Jan 2003 TAG teleconf (..., IRIEverywhere-27, ...)) from Martin Duerst on 2003-02-03 (www-international@w3.org from January to March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 03 Feb 2003 14:13:21 -0500
To: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
Cc: www-international@w3.org, Michel Suignard <michelsu@microsoft.com>
Message-Id: <4.2.0.58.J.20030128143012.06f3d628@localhost>
Dear TAG, others,

Please allow me a few comments (and hopefully clarifications)
on one of the issues below. This mail will mostly concentrate
on one specific issue, namely what RFC 2396 says about
'characters'. Other material in separate mails.

At 20:20 03/01/27 -0500, Ian B. Jacobs wrote:

>Hello,
>
>Minutes of the 27 Jan 2003 TAG teleconf available as
>HTML [1] and as text below.

>   2.3 IRIEverywhere-27

While IRIEverywhere-27 is related to URIEquivalence-15,
and URIEquivalence-15 is of high importance to IRIs,
most of the discussion below is actually about URIEquivalence-15.


>     1. [25]IRIEverywhere-27
>          1. Action MD and CL 2002/11/18: Write up text about
>             IRIEverywhere-27 for spec writers to include in their spec.

I sent some text to Chris quite a while. For the benefit of everybody,
I'll send this as a separate mail.


>          2. Action CL 2002/11/18: Write up finding for IRIEverywhere-27
>             (from TB and TBL, a/b/c), to include MD's text. Process with
>             IJ; awaiting comments from MD
>
>      [25] http://www.w3.org/2001/tag/ilist#IRIEverywhere-27
>
>    [Ian]
>
>           CL: IJ and I discussed this last week. We drew up some text.:
>           Text based on input from Martin Duerst.
>           DC: Value of treating e and E as equivalent is a huge cost.
>           What actual value do you get?
>           CL: Actual value is that both are equivalent to the same
>           character.
>           TB: Lots of software is already treating %7e and %7E as the
>           same.
>
>    [Chris]
>           that they are equivalent to the *actual character* represented
>           uri spec is way fuzzy on this
>           actual practice is that they are the same
>
>    [Ian]
>           TB: Per 2396, I think Web robots are in their rights; lots of
>           Web robots do this.
>           RF: Yes, that's my understanding.
>           PC: Chris, could you explain impact?
>
>    [DanCon]
>           I think it's straightforward to read the URI spec

I think in general, there is too much 'trying to read the
coffee grounds (aka spec)' in what follows. While this is
certainly very important (and I'll do some more of it below),
actual practice should also be considered carefully.




>as saying
>           that [26]http://a/%7E and [27]http://a/%7e are distinct URIs,
>           and may or may not refer to the same resource.
>           roy, you think otherwise?
>
>      [26] http://a/%7E
>      [27] http://a/%7e
>
>    [Roy]
>           I think otherwise
>
>    [Stuart]
>           I read it the same as Dan :-(
>
>    [Ian]
>           CL: There is a bigger effect on IRI spec and suggestions for
>           RFC2396.
>
>    [DanCon]
>           sigh; each of those URIs is a sequence of 12 characters. they
>           differ in their 12th character. hence they're different URIs.
>           RFC2396 says otherwise?
>
>    [Chris]
>           this has more effect on IRI comparison (which is done by
>           transformation to URI and then comparing)
>
>    [Ian]
>           Action CL: Please propose text IJ and CL worked on to www-tag
>           (flipping the ACL).
>
>    [Roy]
>           %7e is one character -- three octets
>
>    [Chris]
>           it means that the *actual kanji* and the sequence of hexifyied
>           octets compare to the same
>           which helps in roundtripping a very great deal
>
>    [Roy]
>           oops one octet -- char[$1\47]
>
>    [DanCon]
>           "%7e" is *one* character???
>
>    [Roy]
>           "character" is defined in spec
>
>    [TBray]
>           was ignoring IRC... yes, lots of software will decide those two
>           URIs are the same in their cache
>
>    [Chris]
>           no, %7e is one octet

okay, Dan says %7e are three characters, Roy says it's one character
(or both actually say that RFC 2396 says so). Let's see:

In '1.1 Overview of URI'
   "Identifier
          An identifier is an object that can act as a reference to
          something that has identity.  In the case of URI, the object is
          a sequence of characters with a restricted syntax."

   That text seems to leave possibilities wide open.

In '1.5. URI Transcribability', we have:
   "A URI is a sequence of characters from a very limited set, i.e. the letters
   of the basic Latin alphabet, digits, and a few special characters. A URI 
may
   be represented in a variety of ways: e.g., ink on paper, pixels on a 
screen,
   or a sequence of octets in a coded character set. The interpretation of a
   URI depends only on the characters used and not how those characters are
   represented in a network protocol."

   Although this is not totally conclusive, Dan's reading seems to match
   here quite a bit better than Roy's.

 From the same section, a bit later:
   "*  A URI may be transcribed from a non-network source, and thus should
       consist of characters that are most likely to be able to be typed 
into a
       computer, within the constraints imposed by keyboards (and related 
input
       devices) across languages and locales."

   Again, although not completely conclusive, this text seems to work much
   better if we can say that the user types '%7e' as three characters,
   rather than as one. (not that keyboards can't be configured to produce
   '%7e' with a single keystroke, but that's very rarely done :-)

Again from the same section, a bit later:
   "For example, it is often the case that the most meaningful name for a URI
    component would require characters that cannot be typed into some systems."

   This is referring to characters that might be e.g. written as %FC, or
   %C3%BC, or so, but by saying 'would require', it seems to somehow imply
   that such a character is not what is talked about in RFC 2396.

 From '1.6. Syntax Notation and Common Elements':
   "Unlike many specifications that use a BNF-like grammar to define the
   bytes (octets) allowed by a protocol, the URI grammar is defined in
   terms of characters.  Each literal in the grammar corresponds to the
   character it represents, rather than to the octet encoding of that
   character in any particular coded character set."

   This, together with text from '2.4.1. Escaped Encoding':

       An escaped octet is encoded as a character triplet, consisting of the
       percent character "%" followed by the two hexadecimal digits
       representing the octet code. For example, "%20" is the escaped
       encoding for the US-ASCII space character.

          escaped     = "%" hex hex
          hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                                "a" | "b" | "c" | "d" | "e" | "f"

   clearly makes it valid to talk about "%" (as well as "7" and "e")
   as characters.

 From "2. URI Characters and Escape Sequences":

     "URI consist of a restricted set of characters, primarily chosen to
      aid transcribability and usability both in computer systems and in
      non-computer communications. Characters used conventionally as
      delimiters around URI were excluded.  The restricted set of
      characters consists of digits, letters, and a few graphic symbols
      were chosen from those common to most of the character encodings and
      input facilities available to Internet users.

         uric          = reserved | unreserved | escaped"

   The heading lets us expect a definition of URI characters. If we
   interpret 'uric' to stand for 'URI Character', then we have a
   very clear definition that agrees with Roy. But the text never
   explicitly says that 'uric' is 'URI Character'.

Just immediately following:

    "Within a URI, characters are either used as delimiters, or to
     represent strings of data (octets) within the delimited portions.
     Octets are either represented directly by a character (using the US-
     ASCII character for that octet [ASCII]) or by an escape encoding."

   This explains that all characters in an URI are used to represent
   octets. 'a' represents the octet <61>, '~' represents the octet
   <7e>, %88 represents the octet <88>, and so on.

   In this sense, it might even be possible to say that Roy got it
   backwards, it would be possible to say
      "%7e is one octet -- three characters"
   or at least
      "%7e represents one octet with three characters"
   rather than
      "%7e is one character -- three octets"


So overall, my conclusion on the question of whether RFC 2396
would talk about '%7e' as one character (Roy) or three (Dan)
is that for the most part, three is much more plausible.
RFC 2396 defines '%7e' to be one instance of the syntax rule
'uric', but it doesn't explicitly say that 'uric' is a character.


Regards,    Martin.
Received on Monday, 3 February 2003 14:13:52 UTC