RE: proposed text on IRIEverywhere-27 from Martin Duerst on 2003-02-04 (www-international@w3.org from January to March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 04 Feb 2003 18:20:08 -0500
To: "Williams, Stuart" <skw@hplb.hpl.hp.com>
Cc: Michel Suignard <michelsu@microsoft.com>, www-international@w3.org, www-tag@w3.org
Message-Id: <4.2.0.58.J.20030204175345.0794dc90@localhost>

Hello Stuart,

At 12:19 03/02/04 +0000, Williams, Stuart wrote:
>Hi Martin,
>
>In the 2nd comparision, if the fully escaped sequences are for comparison
>only, I'm not sure why you protected these 14 characters from being %
>escaped. Is there a reason why excluding them from the expansion is
>neccessary?

Yes, there is a very clear reason. These characters are reserved.
RFC 2396, in "2.2. Reserved Characters", lists the following as
reserved:

     reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                   "$" | ","

To this, we have to add [ and ] for ipv6 literals, and # and
% which are in effect reserved, but treated differently in the
syntax. Escaping them would leave to strange results,
     http://www.example.org/path/file
is definitely NOT the same as
     http://www.example.org/path%2Ffile .

Please note that under certain interpretations of RFC 2397/in
certain cases, this may lead to 'false negatives'. For example,
my understanding is that
    http://www.example.org?text=/
and
    http://www.example.org?text=%2F
are supposed to behave equivalent, because '/' is not reserved
in a query part. But we can't match all such cases in a general
algorithm that has to be scheme-independent.


> > >     - If the group is not a %-group, and if the character is
> > >       one of the following 14 characters, then use that character
> > >       directly:      % # [ ] ; / ? : @ & = + $ ,
> > >       (This will escape characters such as:
> > >          SPACE, < > " { } | \ ^ `
> > >        It currently not clear whether these will be allowed
> > >        as parts of IRIs, but whether they get escaped or not
> > >        will not affect the result of the comparison operation
> > >        if they are not allowed and therefore don't appear in
> > >        input.)
>
>Also, is it clear that only the characters 0-9, a-f and A-F are permissable
>following a % ?

Yes. In RFC 2396, in "A. Collected BNF for URI", the only place
where '%' appears is in:

       escaped       = "%" hex hex

where hex is defined as:

       hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                               "a" | "b" | "c" | "d" | "e" | "f"


> > >   - Separate the string into groups. A group consists of
> > >     either a '%' and the following two characters (a %-group),
> > >     or of a single character that is not part of a %-group.
>
>http://example.org/paris%louvre -> %lo is a group?

In this algorithm, yes. http://example.org/paris%louvre isn't
a legal URI/IRI. So we get 'garbage in'/'garbage out'.


>http://example.org/names%abraham -> %ab is (intended to be) a group?

Yes, of course. '%ab' is a perfectly legal escape sequence.
Human readers may find the word 'abraham' in the URI, but
the URI contains only 'octet <ab>' followed by 'raham'.


Regards,    Martin.

Received on Tuesday, 4 February 2003 18:39:49 UTC