Re: draft-fielding-uri-rfc2396bis-03 from Tim Berners-Lee on 2003-06-19 (uri@w3.org from June 2003)

From: Tim Berners-Lee <timbl@w3.org>
Date: Wed, 18 Jun 2003 22:40:23 -0400
To: Martin Duerst <duerst@w3.org>
Cc: "Roy T. Fielding" <fielding@apache.org>, uri@w3.org
Message-Id: <602AE7B4-A1FF-11D7-AB7B-000393914268@w3.org>
We are talking about a canonicalization.  The choice of which
are escaped is fairly arbitrary.  I agree, the list which I took from 
the current
spec is not appropriate -   "." at least!

The choice is arbitrary.  We are talking about canonicalization,
which has not been defined before.
The choice is driven by the current set of things used
in URIs vs things used as delimiters.  Xpath has increased the former
significantly.  Originally there were many characters which were
not supposed to occur free in a URI (including the typical unix command
line characters) but now we have changed style to do the quoting
in the surrounding environment. So probably erring on the side of
unescaping in the canonical form is best.

The main point is that we should define equivalence and 
canonnicalization
crisply, to give provable properties, and avoid "fear" etc.

Tim

On Wednesday, Jun 18, 2003, at 12:52 US/Eastern, Martin Duerst wrote:

> At 16:13 03/06/16 -0400, Tim Berners-Lee wrote:
>
>> 2.3 Unreserved characters para 2
>>
>> "fear of creating a conflict" and "counterproductive" are much too 
>> vague for a foundational spec.
>>
>> Replace
>>
>> "Therefore, unreserved characters should not be escaped unless the 
>> URI is being used in a context that does not allow the unescaped 
>> character to appear. URI normalization processes may unescape 
>> sequences in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT 
>> (%30-%39), hyphen (%2D), underscore (%5F), or tilde (%7E) without 
>> fear of creating a conflict, but unescaping the other mark characters 
>> is usually counterproductive."
>>
>> with:
>>
>> "URIs which differ only be the replacement of unreserved characters 
>> by the percent-encoded strings are equivalent: they identify the same 
>> resources.
>> For the purposes of canonicalisation, it is recommended that 
>> characters in ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen 
>> (%2D), underscore (%5F), or tilde (%7E) be unescaped, but other 
>> characters in the mark set be escaped."
>>
>>
>> ___________________________________ That's all for now 
>> _________________
>
> Tim, are you trying to say that ".", "!", "*", "'", "(", and ")"
> should always be escaped? I haven't seen too many "!", "*", or "'",
> but recommending this for "." would be completely against current
> practice and would force us to use ugly stuff such as
> http://example.org/document%2Ehtml. Surely nobody would do that.
> And XPointers might look ugly, too.
>
> Also, it seems that you are saying that for these, escaped and
> unescaped are interchangeable, but the escaped form should be
> used because the unescaped form is dangerous. But what I understand
> Roy is trying to say in the current wording is that escaping or
> unescaping one of these is not guaranteed to produce the same
> thing.
>
> Also, I don't see a need to use the unescaped form for the characters
> above. As far as I am aware, they are part of the invariant set
> of the ISO 646 family (which the "~" is not), and they are also
> more available on keyboards than "~".
>
> When I read section 2.3, I noted that I would want to see more
> explanation about "counterproductive". I remember a test by Larry,
> where he found that %2e and %2E were correctly unescaped at least
> on http://www.w3.org, but there may be circumstances
> where this is not the case.
>
>
> Regards,    Martin.
>
> Regards,    Martin.
Received on Wednesday, 18 June 2003 22:40:29 UTC