Re: draft-duerst-iri-07.txt: 2 week mailing list last call from Graham Klyne on 2004-05-10 (uri@w3.org from May 2004)

From: Graham Klyne <GK@ninebynine.org>
Date: Mon, 10 May 2004 12:02:27 +0100
To: Martin Duerst <duerst@w3.org>, public-iri@w3.org
Cc: uri@w3.org
Message-Id: <5.1.0.14.2.20040510111010.02bcc498@127.0.0.1>
Martin,

These comments are based on a quick skim rather than a detailed reading.

Looking at this from an implementer's perspective, I feel it would be 
helpful if the relationship between the IRI and URI *grammars* were more 
clearly delineated;  e.g. a presentation of IRI syntax that is based on the 
RFC2396bis grammar, replacing a minimum number of productions.  On this 
basis, it would be easier to see what needs to be changed in a URI parser 
to yield an IRI parser.  Also, I note that the RFC2396bis grammar has been 
through several revisions as subtle issues are exposed by review and 
implementation experience;  by replicating the entire grammar (rather than 
saying that an IRI is like a URI with designated changes), can you be 
confident that such issues have not been re-introduced?

...

Section 3.1:

There is a subtlety here that is not obvious to one not well-versed in 
Unicode specifics:
[[
       Variant B) If the IRI is in some digital representation (e.g. an
          octet stream) in some known non-Unicode character encoding:
          Convert the IRI to a sequence of characters from the UCS
          normalized according to NFC.

       Variant C) If the IRI is in an Unicode-based character encoding
          (for example UTF-8 or UTF-16): Do not normalize. Move directly
          to Step 2.
]]

This raises two questions in my mind:

(a) what is the implication of this NFC stuff;  I think a brief example 
would help.

(b) by saying "Move directly to Step 2" it sounds as if this is saying that 
step 2 should be operated directly on the "Unicode-based character 
encoding" rather than on the UCS characters, which I don't think is what 
you intend.  I think something like this is intended:
[[
       Variant C) If the IRI is in an Unicode-based character encoding
          (for example UTF-8 or UTF-16): Do not normalize.  Apply step 2
          directly to the encoded Unicode character sequence.
]]

...

Section 3.2:

Is this really true (about always mapping back to the same URI)?:
[[
3.2  Converting URIs to IRIs

    In some situations, it may be desirable to try to convert a URI into
    an equivalent IRI. This section gives a procedure to do such a
    conversion. The conversion described in this section will always
    result in an IRI which maps back to the URI that was used as an input
    for the conversion (except for potential case differences in
    percent-encoding). However, the IRI resulting from this conversion
    may not be exactly the same as the original IRI (if there ever was
    one).
]]

In light of:
[[
    2) Convert all percent-encodings (% followed by two hexadecimal
       digits) except those corresponding to '%', characters in
       'reserved', and characters in US-ASCII not allowed in URIs, to the
       corresponding octets.
]]

It seems to me that removing percent encodings for non-reserved and other 
characters is a non-reversible transformation.  I think that mapping back 
to the original URI is only true under escape normalization, per rfc2396bis.

Also, not knowing anything about bidi encodings, it's difficult for me to 
tell if there's any possible interaction between this and the section 4 
material on bidi sequences.

...

Section 5.1:

[[
5.1  Simple String Comparison

    In some scenarios a definite answer to the question of IRI
    equivalence is needed that is independent of the scheme used and
    always can be calculated quickly and without accessing a network. An
    example of such a case is XML Namespaces ([XMLNamespace]). In such
    cases, two IRIs SHOULD be defined as equivalent if and only if they
    are character-by-character equivalent. This is the same as being
    byte-by-byte equivalent if the character encoding for both IRIs is
    the same. As an example,
    http://example.org/~user, http://example.org/%7euser, and
    http://example.org/%7Euser are not equivalent under this definition.
    In such a case, the comparison function MUST NOT map IRIs to URIs,
    because such a mapping would create additional spurious equivalences.
]]

It's not clear to me what the MUST NOT here is saying.  Making normative 
statements that are conditional on some postulated application scenario 
seems to be a bit confusing to me.   I think the final sentence maybe 
should be:
[[
The IRI to URI mapping function described above [ref] does not preserve 
this form of equivalence.
]]

(Further, the MUST NOT here seems even more perverse in light of the 
introductory material in section 3.1)

I suspect there should be some discouragement of applications depending on 
this level of equivalence, in view of the spurious distinctions that are 
lost when IRIs are converted to URIs.   To my mind the string equivalence 
of the URI-converted form seems like the lowest reasonable level of 
distinction to be encouraged.

...

Section 5.2:

The MUST in the second paragraph seems to be straying inappropriately into 
application design territory.

...

References

I think RFC2119 should appear under Normative references, not Informative.

I don't know about this, but should [UNIV4] and [UNI9] be normative?

...

Finally, I find myself being vaguely concerned about the complexity and 
subtlety of this specification.  I expect that a lot of software will be 
written by programmers who are not aware of the various subtle implications 
of I18N issues.  As such, will it be a realistic expectation for such 
programmers to write robust interoperable software based on this 
specification.  Or, another way of addressing this concern:  to what extent 
can the various subtleties described here be wrapped up in a library that 
can be used successfully by a programmer who is not expert in I18N issues?

(I think part of the difficulty here is the extent to which IRIs straddle 
wire-protocol and user presentation concerns.  I don't normally advocate 
the idea of standardized APIs, but wonder if this is a case for which 
defining a common API might help to flush out some of these concerns.)

#g
--

At 11:07 10/05/04 +0900, Martin Duerst wrote:

>Dear URI Experts,
>
>Yesterday, I have announced a 2 week mailing list last call on the
>public-iri@w3.org mailing list for the newest version of the IRI
>spec, which you can find at
>http://www.w3.org/International/iri-edit/draft-duerst-iri-07.txt.
>Additional information is at http://www.w3.org/International/iri-edit/.
>The last call ends on Sunday, May 23, 2004.
>
>I'm copying this announcement (see below) here because the IRI
>draft relies strongly on RFC 2396bis.
>
>So if you have any comments (I of course hope they will be mainly
>of editorial nature), please send them to public-iri@w3.org
>(and not to this list) by May 23.
>
>Regards,   Martin.
>
>>Date: Sun, 09 May 2004 21:46:44 +0900
>>To: public-iri@w3.org
>>From: Martin Duerst <duerst@w3.org>
>>Subject: 2 week mailing list last call
>>X-Archived-At: 
>>http://www.w3.org/mid/4.2.0.58.J.20040509212717.059f56b8@localhost
>>
>>With no open issues and only two tentatively closed ones remaining,
>>I just submitted draft-duerst-iri-07.txt to the Internet-Drafts
>>Editor. It is also available at
>>http://www.w3.org/International/iri-edit/draft-duerst-iri-07.txt.
>>
>>If you find any new issues think that an old one needs to be reopened,
>>please say so on this mailing list (public-iri@w3.org) within the next
>>two weeks (up to and including Sunday, May 23).
>>If no significant changes are required, I will send this spec
>>to the IESG afterwards.
>>
>>In other words, this is a two-week mailing list last call.
>>
>>Regards,    Martin.
>
>------------
>Graham Klyne
>For email:
>http://www.ninebynine.org/#Contact
Received on Monday, 10 May 2004 09:54:03 UTC