URIEquivalence-15: equivalence of %7e/%7E/~ (was:Re: [Minutes] 27 Jan 2003 TAG teleconf (... IRIEverywhere-27,...)) from Martin Duerst on 2003-02-03 (www-tag@w3.org from February 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 03 Feb 2003 14:08:16 -0500
To: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
Cc: Michel Suignard <michelsu@microsoft.com>, www-international@w3.org
Message-Id: <4.2.0.58.J.20030203133714.03ff2c60@localhost>
At 20:20 03/01/27 -0500, Ian B. Jacobs wrote:

>Hello,
>
>Minutes of the 27 Jan 2003 TAG teleconf available as
>HTML [1] and as text below.

>   2.3 IRIEverywhere-27

>      [25] http://www.w3.org/2001/tag/ilist#IRIEverywhere-27
>
>    [Ian]

>           TB: Lots of software is already treating %7e and %7E as the
>           same.
>
>    [Chris]
>           that they are equivalent to the *actual character* represented
>           uri spec is way fuzzy on this
>           actual practice is that they are the same
>
>    [Ian]
>           TB: Per 2396, I think Web robots are in their rights; lots of
>           Web robots do this.
>           RF: Yes, that's my understanding.
>           PC: Chris, could you explain impact?
>
>    [DanCon]
>           I think it's straightforward to read the URI spec as saying
>           that [26]http://a/%7E and [27]http://a/%7e are distinct URIs,
>           and may or may not refer to the same resource.
>           roy, you think otherwise?
>
>      [26] http://a/%7E
>      [27] http://a/%7e
>
>    [Roy]
>           I think otherwise
>
>    [Stuart]
>           I read it the same as Dan :-(

>    [DanCon]
>           sigh; each of those URIs is a sequence of 12 characters. they
>           differ in their 12th character. hence they're different URIs.
>           RFC2396 says otherwise?

>    [Chris]
>           it means that the *actual kanji* and the sequence of hexifyied
>           octets compare to the same
>           which helps in roundtripping a very great deal

>    [TBray]
>           was ignoring IRC... yes, lots of software will decide those two
>           URIs are the same in their cache
>
>    [Chris]
>           no, %7e is one octet

>    [Zakim]
>    DanCon, you wanted to suggest the value of having %7E specified to be
>           equivalent to %7e is purely aesthetic, and not *nearly* worth
>           the cost.



So let's look at RFC 2396 on this:

First, we find in '1.1 Overview of URI'

      URI are characterized by the following definitions:

       Uniform
          Uniformity provides several benefits: it allows different types
          of resource identifiers to be used in the same context, even
          when the mechanisms used to access those resources may differ;
          it allows uniform semantic interpretation of common syntactic
          conventions across different types of resource identifiers;

   The %-escape convention can definitely be seen as a 'common syntactic
   convention'.


Next, "2.1 URI and non-ASCII characters" says:

    A URI scheme may define a mapping from URI characters to octets;
    whether this is done depends on the scheme. Commonly, within a
    delimited component of a URI, a sequence of characters may be used to
    represent a sequence of octets. For example, the character "a"
    represents the octet 97 (decimal), while the character sequence "%",
    "0", "a" represents the octet 10 (decimal).

  This is rather confusing. It says 'a scheme may define', but then
  goes on positively as 'the character "a" represents the octet 97 (decimal)'.
  There are several ways to read this, one of which being that a scheme
  can define additional (or alternative??) ways to represent octets; base64
  as used in the data: uri scheme would then be an example. I have requested
  that an issue be opened for rfc2396bis for this.
  (see http://lists.w3.org/Archives/Public/uri/2003Jan/0025.html)

More from this section:

    In the simplest case, the original character sequence contains only
    characters that are defined in US-ASCII, and the two levels of
    mapping are simple and easily invertible: each 'original character'
    is represented as the octet for the US-ASCII code for it, which is,
    in turn, represented as either the US-ASCII character, or else the
    "%" escape sequence for that octet.

  It is unclear whether the 'either/or' in the last sentence means
  'you can always choose either one', or whether it means 'you use the
  US-ASCII character if that's available, and otherwise escape.
  'always can choose either one' would definitely only apply within
  the restriction of unreserved characters. For reserved characters,
  the escaped and unescaped forms have different meanings.

 From "2.4.2. When to Escape and Unescape" we have:

    A URI is always in an "escaped" form, since escaping or unescaping a
    completed URI might change its semantics.  Normally, the only time
    escape encodings can safely be made is when the URI is being created
    from its component parts; each component may have its own set of
    characters that are reserved, so only the mechanism responsible for
    generating or interpreting that component can determine whether or
    not escaping a character will change its semantics. Likewise, a URI
    must be separated into its components before the escaped characters
    within those components can be safely decoded.

   This seems to say that escaping or unescaping can only be done
   with detailled knowledge of the scheme. However, this could be
   read in two ways: 1) it is unclear which characters exactly from
   the 'reserved' category will actually be reserved somewhere, so
   these considerations apply only to these characters, and not
   to the unreserved ones; or 2) this applies to all characters,
   despite the fact that there is a 'reserved' character category,
   *any* character could be potentially reserved in a new scheme.

 From the same section:

    In some cases, data that could be represented by an unreserved
    character may appear escaped; for example, some of the unreserved
    "mark" characters are automatically escaped by some systems.  If the
    given URI scheme defines a canonicalization algorithm, then
    unreserved characters may be unescaped according to that algorithm.
    For example, "%7e" is sometimes used instead of "~" in an http URL
    path, but the two are equivalent for an http URL.

  This says 'treat "%7e" and "~" as equivalent only if you know about
  the scheme. This seems to point pretty strongly in one direction.


So my conclusion is that RFC 2396 contains very little text about
equivalence between "%7e" and "%7E", and is not very conclusive
about the equivalence between "%7e/E" and "~" either.
The strongest argumentation for these equivalences is probably that
the escape syntax appears in most parts of the generic uri syntax
and also in the opaque syntax, and that (with exception of reserved
characters) the spec says that all three represent the same octet.


Regards,    Martin.
Received on Monday, 3 February 2003 14:13:05 UTC