- From: Graham Klyne <GK@ninebynine.org>
- Date: Mon, 10 May 2004 12:02:27 +0100
- To: Martin Duerst <duerst@w3.org>, public-iri@w3.org
- Cc: uri@w3.org
Martin,
These comments are based on a quick skim rather than a detailed reading.
Looking at this from an implementer's perspective, I feel it would be
helpful if the relationship between the IRI and URI *grammars* were more
clearly delineated; e.g. a presentation of IRI syntax that is based on the
RFC2396bis grammar, replacing a minimum number of productions. On this
basis, it would be easier to see what needs to be changed in a URI parser
to yield an IRI parser. Also, I note that the RFC2396bis grammar has been
through several revisions as subtle issues are exposed by review and
implementation experience; by replicating the entire grammar (rather than
saying that an IRI is like a URI with designated changes), can you be
confident that such issues have not been re-introduced?
...
Section 3.1:
There is a subtlety here that is not obvious to one not well-versed in
Unicode specifics:
[[
Variant B) If the IRI is in some digital representation (e.g. an
octet stream) in some known non-Unicode character encoding:
Convert the IRI to a sequence of characters from the UCS
normalized according to NFC.
Variant C) If the IRI is in an Unicode-based character encoding
(for example UTF-8 or UTF-16): Do not normalize. Move directly
to Step 2.
]]
This raises two questions in my mind:
(a) what is the implication of this NFC stuff; I think a brief example
would help.
(b) by saying "Move directly to Step 2" it sounds as if this is saying that
step 2 should be operated directly on the "Unicode-based character
encoding" rather than on the UCS characters, which I don't think is what
you intend. I think something like this is intended:
[[
Variant C) If the IRI is in an Unicode-based character encoding
(for example UTF-8 or UTF-16): Do not normalize. Apply step 2
directly to the encoded Unicode character sequence.
]]
...
Section 3.2:
Is this really true (about always mapping back to the same URI)?:
[[
3.2 Converting URIs to IRIs
In some situations, it may be desirable to try to convert a URI into
an equivalent IRI. This section gives a procedure to do such a
conversion. The conversion described in this section will always
result in an IRI which maps back to the URI that was used as an input
for the conversion (except for potential case differences in
percent-encoding). However, the IRI resulting from this conversion
may not be exactly the same as the original IRI (if there ever was
one).
]]
In light of:
[[
2) Convert all percent-encodings (% followed by two hexadecimal
digits) except those corresponding to '%', characters in
'reserved', and characters in US-ASCII not allowed in URIs, to the
corresponding octets.
]]
It seems to me that removing percent encodings for non-reserved and other
characters is a non-reversible transformation. I think that mapping back
to the original URI is only true under escape normalization, per rfc2396bis.
Also, not knowing anything about bidi encodings, it's difficult for me to
tell if there's any possible interaction between this and the section 4
material on bidi sequences.
...
Section 5.1:
[[
5.1 Simple String Comparison
In some scenarios a definite answer to the question of IRI
equivalence is needed that is independent of the scheme used and
always can be calculated quickly and without accessing a network. An
example of such a case is XML Namespaces ([XMLNamespace]). In such
cases, two IRIs SHOULD be defined as equivalent if and only if they
are character-by-character equivalent. This is the same as being
byte-by-byte equivalent if the character encoding for both IRIs is
the same. As an example,
http://example.org/~user, http://example.org/%7euser, and
http://example.org/%7Euser are not equivalent under this definition.
In such a case, the comparison function MUST NOT map IRIs to URIs,
because such a mapping would create additional spurious equivalences.
]]
It's not clear to me what the MUST NOT here is saying. Making normative
statements that are conditional on some postulated application scenario
seems to be a bit confusing to me. I think the final sentence maybe
should be:
[[
The IRI to URI mapping function described above [ref] does not preserve
this form of equivalence.
]]
(Further, the MUST NOT here seems even more perverse in light of the
introductory material in section 3.1)
I suspect there should be some discouragement of applications depending on
this level of equivalence, in view of the spurious distinctions that are
lost when IRIs are converted to URIs. To my mind the string equivalence
of the URI-converted form seems like the lowest reasonable level of
distinction to be encouraged.
...
Section 5.2:
The MUST in the second paragraph seems to be straying inappropriately into
application design territory.
...
References
I think RFC2119 should appear under Normative references, not Informative.
I don't know about this, but should [UNIV4] and [UNI9] be normative?
...
Finally, I find myself being vaguely concerned about the complexity and
subtlety of this specification. I expect that a lot of software will be
written by programmers who are not aware of the various subtle implications
of I18N issues. As such, will it be a realistic expectation for such
programmers to write robust interoperable software based on this
specification. Or, another way of addressing this concern: to what extent
can the various subtleties described here be wrapped up in a library that
can be used successfully by a programmer who is not expert in I18N issues?
(I think part of the difficulty here is the extent to which IRIs straddle
wire-protocol and user presentation concerns. I don't normally advocate
the idea of standardized APIs, but wonder if this is a case for which
defining a common API might help to flush out some of these concerns.)
#g
--
At 11:07 10/05/04 +0900, Martin Duerst wrote:
>Dear URI Experts,
>
>Yesterday, I have announced a 2 week mailing list last call on the
>public-iri@w3.org mailing list for the newest version of the IRI
>spec, which you can find at
>http://www.w3.org/International/iri-edit/draft-duerst-iri-07.txt.
>Additional information is at http://www.w3.org/International/iri-edit/.
>The last call ends on Sunday, May 23, 2004.
>
>I'm copying this announcement (see below) here because the IRI
>draft relies strongly on RFC 2396bis.
>
>So if you have any comments (I of course hope they will be mainly
>of editorial nature), please send them to public-iri@w3.org
>(and not to this list) by May 23.
>
>Regards, Martin.
>
>>Date: Sun, 09 May 2004 21:46:44 +0900
>>To: public-iri@w3.org
>>From: Martin Duerst <duerst@w3.org>
>>Subject: 2 week mailing list last call
>>X-Archived-At:
>>http://www.w3.org/mid/4.2.0.58.J.20040509212717.059f56b8@localhost
>>
>>With no open issues and only two tentatively closed ones remaining,
>>I just submitted draft-duerst-iri-07.txt to the Internet-Drafts
>>Editor. It is also available at
>>http://www.w3.org/International/iri-edit/draft-duerst-iri-07.txt.
>>
>>If you find any new issues think that an old one needs to be reopened,
>>please say so on this mailing list (public-iri@w3.org) within the next
>>two weeks (up to and including Sunday, May 23).
>>If no significant changes are required, I will send this spec
>>to the IESG afterwards.
>>
>>In other words, this is a two-week mailing list last call.
>>
>>Regards, Martin.
>
>------------
>Graham Klyne
>For email:
>http://www.ninebynine.org/#Contact
Received on Monday, 10 May 2004 09:54:03 UTC