- From: Graham Klyne <GK@ninebynine.org>
- Date: Mon, 10 May 2004 12:02:27 +0100
- To: Martin Duerst <duerst@w3.org>, public-iri@w3.org
- Cc: uri@w3.org
Martin, These comments are based on a quick skim rather than a detailed reading. Looking at this from an implementer's perspective, I feel it would be helpful if the relationship between the IRI and URI *grammars* were more clearly delineated; e.g. a presentation of IRI syntax that is based on the RFC2396bis grammar, replacing a minimum number of productions. On this basis, it would be easier to see what needs to be changed in a URI parser to yield an IRI parser. Also, I note that the RFC2396bis grammar has been through several revisions as subtle issues are exposed by review and implementation experience; by replicating the entire grammar (rather than saying that an IRI is like a URI with designated changes), can you be confident that such issues have not been re-introduced? ... Section 3.1: There is a subtlety here that is not obvious to one not well-versed in Unicode specifics: [[ Variant B) If the IRI is in some digital representation (e.g. an octet stream) in some known non-Unicode character encoding: Convert the IRI to a sequence of characters from the UCS normalized according to NFC. Variant C) If the IRI is in an Unicode-based character encoding (for example UTF-8 or UTF-16): Do not normalize. Move directly to Step 2. ]] This raises two questions in my mind: (a) what is the implication of this NFC stuff; I think a brief example would help. (b) by saying "Move directly to Step 2" it sounds as if this is saying that step 2 should be operated directly on the "Unicode-based character encoding" rather than on the UCS characters, which I don't think is what you intend. I think something like this is intended: [[ Variant C) If the IRI is in an Unicode-based character encoding (for example UTF-8 or UTF-16): Do not normalize. Apply step 2 directly to the encoded Unicode character sequence. ]] ... Section 3.2: Is this really true (about always mapping back to the same URI)?: [[ 3.2 Converting URIs to IRIs In some situations, it may be desirable to try to convert a URI into an equivalent IRI. This section gives a procedure to do such a conversion. The conversion described in this section will always result in an IRI which maps back to the URI that was used as an input for the conversion (except for potential case differences in percent-encoding). However, the IRI resulting from this conversion may not be exactly the same as the original IRI (if there ever was one). ]] In light of: [[ 2) Convert all percent-encodings (% followed by two hexadecimal digits) except those corresponding to '%', characters in 'reserved', and characters in US-ASCII not allowed in URIs, to the corresponding octets. ]] It seems to me that removing percent encodings for non-reserved and other characters is a non-reversible transformation. I think that mapping back to the original URI is only true under escape normalization, per rfc2396bis. Also, not knowing anything about bidi encodings, it's difficult for me to tell if there's any possible interaction between this and the section 4 material on bidi sequences. ... Section 5.1: [[ 5.1 Simple String Comparison In some scenarios a definite answer to the question of IRI equivalence is needed that is independent of the scheme used and always can be calculated quickly and without accessing a network. An example of such a case is XML Namespaces ([XMLNamespace]). In such cases, two IRIs SHOULD be defined as equivalent if and only if they are character-by-character equivalent. This is the same as being byte-by-byte equivalent if the character encoding for both IRIs is the same. As an example, http://example.org/~user, http://example.org/%7euser, and http://example.org/%7Euser are not equivalent under this definition. In such a case, the comparison function MUST NOT map IRIs to URIs, because such a mapping would create additional spurious equivalences. ]] It's not clear to me what the MUST NOT here is saying. Making normative statements that are conditional on some postulated application scenario seems to be a bit confusing to me. I think the final sentence maybe should be: [[ The IRI to URI mapping function described above [ref] does not preserve this form of equivalence. ]] (Further, the MUST NOT here seems even more perverse in light of the introductory material in section 3.1) I suspect there should be some discouragement of applications depending on this level of equivalence, in view of the spurious distinctions that are lost when IRIs are converted to URIs. To my mind the string equivalence of the URI-converted form seems like the lowest reasonable level of distinction to be encouraged. ... Section 5.2: The MUST in the second paragraph seems to be straying inappropriately into application design territory. ... References I think RFC2119 should appear under Normative references, not Informative. I don't know about this, but should [UNIV4] and [UNI9] be normative? ... Finally, I find myself being vaguely concerned about the complexity and subtlety of this specification. I expect that a lot of software will be written by programmers who are not aware of the various subtle implications of I18N issues. As such, will it be a realistic expectation for such programmers to write robust interoperable software based on this specification. Or, another way of addressing this concern: to what extent can the various subtleties described here be wrapped up in a library that can be used successfully by a programmer who is not expert in I18N issues? (I think part of the difficulty here is the extent to which IRIs straddle wire-protocol and user presentation concerns. I don't normally advocate the idea of standardized APIs, but wonder if this is a case for which defining a common API might help to flush out some of these concerns.) #g -- At 11:07 10/05/04 +0900, Martin Duerst wrote: >Dear URI Experts, > >Yesterday, I have announced a 2 week mailing list last call on the >public-iri@w3.org mailing list for the newest version of the IRI >spec, which you can find at >http://www.w3.org/International/iri-edit/draft-duerst-iri-07.txt. >Additional information is at http://www.w3.org/International/iri-edit/. >The last call ends on Sunday, May 23, 2004. > >I'm copying this announcement (see below) here because the IRI >draft relies strongly on RFC 2396bis. > >So if you have any comments (I of course hope they will be mainly >of editorial nature), please send them to public-iri@w3.org >(and not to this list) by May 23. > >Regards, Martin. > >>Date: Sun, 09 May 2004 21:46:44 +0900 >>To: public-iri@w3.org >>From: Martin Duerst <duerst@w3.org> >>Subject: 2 week mailing list last call >>X-Archived-At: >>http://www.w3.org/mid/4.2.0.58.J.20040509212717.059f56b8@localhost >> >>With no open issues and only two tentatively closed ones remaining, >>I just submitted draft-duerst-iri-07.txt to the Internet-Drafts >>Editor. It is also available at >>http://www.w3.org/International/iri-edit/draft-duerst-iri-07.txt. >> >>If you find any new issues think that an old one needs to be reopened, >>please say so on this mailing list (public-iri@w3.org) within the next >>two weeks (up to and including Sunday, May 23). >>If no significant changes are required, I will send this spec >>to the IESG afterwards. >> >>In other words, this is a two-week mailing list last call. >> >>Regards, Martin. > >------------ >Graham Klyne >For email: >http://www.ninebynine.org/#Contact
Received on Monday, 10 May 2004 09:54:03 UTC