RE: Draft 2 of "How to Compare URIs" from Martin Duerst on 2003-02-17 (www-tag@w3.org from February 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 17 Feb 2003 18:31:40 -0500
To: "Williams, Stuart" <skw@hplb.hpl.hp.com>, "'Tim Bray'" <tbray@textuality.com>
Cc: WWW-Tag <www-tag@w3.org>
Message-Id: <4.2.0.58.J.20030217180716.053bdcc0@localhost>

At 17:38 03/01/06 +0000, Williams, Stuart wrote:

>2) On the topic of %-escape encoding, which I continue to find confusing
>despite the opening sentence in RFC 2396 section 2.1.
>
>RFC 2396 appears to delgate the 'URI Character -> octet' mapping to the URI
>scheme definition. The 4th Paragraph of Sec 2.1 begins:
>
>   "A URI scheme may define a mapping from URI
>    characters to octets; whether this is done
>    depends on the scheme."

I have recently bumped into this text, too. I have asked for
clarification on the URI list:
http://lists.w3.org/Archives/Public/uri/2003Jan/0025.html

In particular, I wrote:

 >>>>
   As far as I understand, %hh is always usable, and I don't know
   about any schemes that define explicitly that this can be used.
   It may have been that this paragraph was written to take into
   account schemes such as data:, where an additional mechanism
   for encoding octets (base64) is used. My understanding is that
   even in a data: URI, I should still be able to replace "A" by
   "%41", and it should still resolve to the same data.
 >>>>

I would really like to see an example where escape differences
of non-reserved characters return different results (as opposed
to compare differently). I'm not aware of any.
Unless some major case turns up, I think it would be very
beneficial if the TAG would nail down the principle that
for purposes of resolution/retrieval, 'a' and '%61', and so
on, have to return the same thing. This would definitely
also be very helpful for IRIs.


>Then, regarding the second mapping RFC 2396 speaks of  'octets -> original
>characters': "A charset defines this mapping." RFC2396 states "However,
>there is currently no provision within the generic URI syntax to accomplish
>this identification." It then offers possible options including delegation
>of charset default and/or selection mechanism to URI scheme definition.
>
>The URI Scheme registration template RFC2717 includes a field for "character
>encoding consideration". However, on a quick scan of the scheme
>registrations referenced from http://www.iana.org/assignments/uri-schemes I
>couldn't find any that offered any "character encoding consideration" :-)

RFC 2192 does (look for 9. Multinational Considerations). It is probably
not the only one.


>However, I think that there is an upside. Even if the first URI character ->
>octet mapping is scheme dependent, I think that one can be confident that
>for all %xx, for http://example.com/%xx and http://example.com/%xx, the
>octet sequences arising from the first mapping will be identical because the
>same scheme is in use. It's less clear that the second mapping, the charset
>which maps octets to original characters, is going to be the same in all
>contexts (like some of the forms examples)... however, in a given context...
>http://example.com/%xx will be equivalent to itself (surely!).

Yes indeed. For the equivalences discussed in Tim Bray's document,
this as you call it 'second mapping' is irrelevant.


Regards,    Martin.

Received on Monday, 17 February 2003 19:55:34 UTC