URIEquivalence-15 (was: Re: [Minutes] 22 July TAG teleconf) from Martin Duerst on 2002-07-24 (www-tag@w3.org from July 2002)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 25 Jul 2002 04:36:54 +0900
To: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
Cc: w3c-i18n-ig@w3.org, www-i18n-comments@w3.org
Message-Id: <4.2.0.58.J.20020725042523.04405b58@localhost>
At 18:23 02/07/23 -0400, Ian B. Jacobs wrote:

>   2.6 URIEquivalence-15
>
>     1. Status of URIEquivalence-15. Relation to
>        Character Model of the Web (chapter 4)? See text
>        from TimBL on URI canonicalization and email from
>        Martin in particular.
>
>        TB: This is serious. Martin seems to be saying
>        "deal with it"

Yes, exactly. Thanks!


>        DC: Two reasons:
>
>          1. The only way you can be sure that a consumer
>             will notice that you mean the same thing is
>             that you've spelled it the same way. I think
>             that they're not wrong. Nothing wrong with
>             string compare.
>          2. In general, it's an art to gather that
>             something spelled differently means the same
>             thing.
>
>        TB: If we believe that, should there be a
>        recommendation that "when you do this, only
>        %-escape when you have to, and use lowercase
>        letters." Where should that be written?
>        DC: Shortest path to target is the I18N WG.
>        RFC 2396 applies equally to all URI schemes.
>        Generating absolute from relative URI is not
>        scheme-specific.
>        DO: There are absolutization scheme(s) and
>        things like scheme-specific rules (e.g.,
>        generating an absolute) and we should take
>        this into account when we talk about doing a
>        string compare.
>        RF: Different issues here. There is a syntax
>        mechanism to go from rel URI to abs URI. But
>        no scheme-specific semantics on that. There
>        are scheme-specific fields (e.g,. host name)
>        that have equivalence rules. It boils down to
>        this: the most efficient way to deal with
>        these cases is to require a canonical form and
>        compare by bytes.
>
>    [DanC]
>           There's stuff like http://www.w3.org:80/ and
>           http://www.w3.org/ , which are specified, in a
>           scheme-specific manner, to mean the same
>           thing.
>
>    [Ian]
>           DO: So, canonicalize according to scheme and
>           generic rules, then compare.
>           RF: The only entity that does the
>           canonicalization is the URI generator; not at
>           comparison time. Inefficient to canonicalize
>           at compare time.
>
>    [Ian]
>        RF: Making a URI absolute is
>        scheme-independent. That's required so we can
>        add schemes later on.
>        DC: There was a backlash in the XML community
>        about saying absolutize.
>        TB: That was a different issue.
>        DC: I don't understand the difference.
>        DO: Namespaces used as identifiers rather than
>        for dereferencing. Requiring absolute URIs was
>        meant to facilitate authoring.
>        TB: I hear people arguing that string
>        comparison necessary. I think there needs to
>        be a statement of principle to get good
>        results:
>
>       1. Don't use %-escape unless you have to.
>       2. Yse lowercase when doing so.
>
>        TB: Where do we take these suggestions?: (a)
>        We have a section on the arch doc on comparing
>        URIs or (b) ask I18N WG to deal with this.
>        RF: Or add a stronger suggestion to the URI
>        spec itself.
>        TB: That's a wonderful answer!
>        RF: I can add this to the issues list (section
>        on URI canonicalization). I can't promise that
>        it will be answered there.

I think it belongs in an updated version of the URI spec.
But because it's of particular importance for IRIs, and
because I think the IRI spec will move ahead before the
revision of the URI spec, I have added something in the
editing version of the IRI spec.
(see http://www.w3.org/International/Group/iri-edit/
for those who have member access):

 >>>>
    2) Convert each octet to %hh, where hh is the hexadecimal
       notation of the octet value.  Note: This is identical to
       the escaping mechanism in Section 2.4.1 of [RFC2396].
       Note: To reduce variability, the hexadecimal notation
       should use lower case letters.
 >>>>

This earlier read:
<<<<
    2) Convert each octet to %HH, where HH is the hexadecimal
       notation of the octet value.  Note: This is identical to
       the escaping mechanism in Section 2.4.1 of [RFC2396].
<<<<

Any comments appreciated.
("1. Don't use %-escape unless you have to." is already covered.)


Regards,    Martin.


>        DC: I don't think we should punt this
>        entirely. For URIs, it's fine to do string
>        compare. For URI references, it's fine to
>        absolutize and then do string compare. That
>        works for me.
>        SW: I agree with TB that we should have
>        something in arch doc. That should be in sync
>        with the emerging URI spec.
>        DO: How about as little as "there are good
>        rules for doing this; go see the URI spec and
>        the IRI specs for more info..."
>
>    [DanC]
>        "Can the same resource have different URIs?
>        Does http://WWW.EXAMPLE/ identify the same
>        resource as http://www.example/?"
>        -- FAQ on URIs
>
>    [Ian]
>        DC: Is it useful to do a finding in the mean
>        time?
>        IJ: I hope to harvest from Dan's FAQ.
>        TB: I think that if in arch doc, probably
>        don't need a finding.
>        Action IJ: Harvest from Dan's FAQ for arch
>        document.
>
>    Resolved: the Arch Doc should mention this issue.
Received on Wednesday, 24 July 2002 15:44:16 UTC