Re: draft-duerst-iri-07.txt: 2 week mailing list last call from Martin Duerst on 2004-05-19 (public-iri@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 19 May 2004 14:18:16 +0900
To: Graham Klyne <gk@ninebynine.org>, public-iri@w3.org
Message-Id: <4.2.0.58.J.20040519132613.05ff2ef0@localhost>
Hello Graham,

At 12:44 04/05/12 +0100, Graham Klyne wrote:

>At 18:00 12/05/04 +0900, Martin Duerst wrote:
>>Hello Graham,
>>
>>I have made this issue charcompareMUST-31.
>>
>>
>>At 12:02 04/05/10 +0100, Graham Klyne wrote:
>>
>>
>>>Section 5.1:
>>>
>>>[[
>>>5.1  Simple String Comparison
>>>
>>>    In some scenarios a definite answer to the question of IRI
>>>    equivalence is needed that is independent of the scheme used and
>>>    always can be calculated quickly and without accessing a network. An
>>>    example of such a case is XML Namespaces ([XMLNamespace]). In such
>>>    cases, two IRIs SHOULD be defined as equivalent if and only if they
>>>    are character-by-character equivalent. This is the same as being
>>>    byte-by-byte equivalent if the character encoding for both IRIs is
>>>    the same. As an example,
>>>    http://example.org/~user, http://example.org/%7euser, and
>>>    http://example.org/%7Euser are not equivalent under this definition.
>>>    In such a case, the comparison function MUST NOT map IRIs to URIs,
>>>    because such a mapping would create additional spurious equivalences.
>>>]]
>>>
>>>It's not clear to me what the MUST NOT here is saying.  Making normative 
>>>statements that are conditional on some postulated application scenario 
>>>seems to be a bit confusing to me.
>>
>>If you interpreted the statement as conditional on some application
>>scenario, then it is indeed confusing. It was intended conditional
>>to the comparison function. I.e. if you use character-by-character
>>comparison, you MUST NOT map IRIs to URIs,
>>because such a mapping would create additional spurious equivalences.
>
>I was taking the choice of comparison function to be part of the 
>application scenario.

Well, I think it depends on what you mean by 'application'. If
'XML Namespaces' or 'RDF' are applications, then you are right.
If each XML or RDF parser are applications, then it is wrong.


>>I have replaced "In such a case" with "When comparing 
>>character-by-character".
>
>I think that's better, though it doesn't quite capture my original 
>comment.  (Consider:  as this is given as a normative statement, how do 
>you propose to find interoperable implementations to demonstrate 
>conformance when moving to Draft Standard?  I still prefer my suggestion 
>(below), but  now I've raised the issue I'm happy for you to decide.

Well, testing a few XSLT implementations would clearly show that they
treat IRIs as such, and don't convert them to URIs, for namespace URIs.
I have already done such tests quite a while ago.
So I don't think demonstrating conformance would be a big deal for this.


>>>I think the final sentence maybe should be:
>>>[[
>>>The IRI to URI mapping function described above [ref] does not preserve 
>>>this form of equivalence.
>>>]]
>
>
>>>(Further, the MUST NOT here seems even more perverse in light of the 
>>>introductory material in section 3.1)
>>
>>I have checked that material again, and did not find any problems.
>>You may observe that that material is carefully worded in terms of
>>retrieval when it comes to IRI->URI mapping, not in terms of
>>abstract resource identification.
>
>OK, ignore that last comment.  (I wasn't specifically thinking about 
>abstract identification.)
>
>But I note that it's not obvious to me that start of section 3.1 is 
>subject to the mention of "resource retrieval" that appears in section 3[.0].

This is again taken up and explained in more detail in "b) Interpretational:".


>Indeed the fact that the material in 3.1 is also said to apply to 
>references and fragment identifiers suggests otherwise.

You mean the sentence:
"Everything in this section applies also to IRI references and URI references,
as well as components thereof (for example fragment identifiers)."

I agree that the 'everything' may be a bit general. But what is 'everything'
of 3.1? First, it is the mapping procedure itself. This means that you can
map an IRI reference to an URI reference in the same way you can map an
IRI to an URI. And you can map a fragid in an IRI to a fragid in an URI
in the same way as you map an IRI to an URI.

Second, it is the context/purpose. Two purposes are given:
- retrieval
- syntax check
With respect to syntax check, the sentence above means that you can
syntax check an IRI reference or a fragid of an IRI by converting to
an URI reference or fragid of an URI. With respect to retrieval,
with an appropriate base, IRI references and URI references can
definitely be retrieved. You are right that retrieval of
fragment identifiers does not make that much sense.


>Checking for scheme-specific syntax restrictions does not seem to be 
>specifically related to resource retrieval.  (cf. URN syntax checking.)

Yes, right, I forgot to mention it before.


>Looking more closely at point (b) in 3.1, which clearly *is* about 
>resource retrieval, I find myself having further qualms:
>[[
>However, when an IRI is used for resource
>retrieval, the resource that the IRI locates is the same as the
>one located by the URI obtained after converting the IRI according
>to the procedure defined here. This means that there is no need to
>define resolution separately on the IRI level.
>]]
>
>This seems to preclude the possibility of defining a resolution protocol 
>that uses IRIs natively.

No. This just precludes the possibility of defining a resolution protocol
that makes a distinction e.g. between '%7e', '%7E', and '~'. Most resolution
protocols are not able to make this distinction, because they do not use
percent-encoding, and therefore the client resolves these differences.
The only protocol I know that doesn't is HTTP, and HTTP also defines
that these are equivalent, and servers implement things that way
(except where the percent-encoding is used to distinguish reserved
characters that the server has to deal with, such as '?').


>Effectively, this is an imposition on any future protocol specification 
>that can be used to resolve IRIs, which seems like a rather broad 
>sweep.  Maybe this is OK, and really is what was intended, but I feel 
>compelled to at least mention the point.   If this is what you intend, I 
>think the point would usefully be more prominent in the text, and should 
>be made a normative assertion;  e.g. a top-level paragraph ala:
>[[
>When an IRI is used for resource retrieval, >>it must be by means of a 
>protocol that
>can also be used with URIs, and<<

This would be a tautology, because all URIs are by definition IRIs.


>the resource that the IRI locates MUST be the same as
>the one located by the URI obtained after converting the IRI according to the
>procedure defined here.
>]]

This is given normatively in Section 5.2:

"For actual resolution, differences in percent-encoding (except for the
percent-encoding of reserved characters) MUST always result in the same 
resource."

Do you want to say that this should be in two places, or that it is
in the wrong place?


>It might be argued that the text between >> and << is redundant, to the 
>extent that any URI is also a valid IRI.

Yes indeed, I just did that, sorry.


>(But, thinking aloud, ... suppose I wanted to invent a new IRI scheme and 
>protocol to serve as a kind of Chinese WordNet, with definitions 
>retrievable in much the same way as they are for 
>WordNet.  (Notwithstanding that this may not be a good idea for other 
>reasons.)  In such a scheme, maybe there is a component which, according 
>to the IRI scheme specification, must contain Chinese character symbol(s), 
>so there are no URIs that are valid IRIs according to this scheme.

No, all the URIs that result from percent-encoding the UTF-8-encoded
Chinese characters would be legal. Not necessarily in the protocol
(the protocol may specify that these are just transmitted as raw octets),
but on the URI/IRI level.


>I don't know where this leads.  My main point is to try and raise a 
>vaguely plausible scenario in which existence of a URI form for resource 
>retrieval may be undesirable.)

Well, of course it would not be very desirable to use the percent-encoded
form of a Chinese WordNet IRI, at least not for Chinese speakers. But
it could be convenient for you, assuming you don't read Chinese.


>>>I suspect there should be some discouragement of applications depending 
>>>on this level of equivalence, in view of the spurious distinctions that 
>>>are lost when IRIs are converted to URIs.   To my mind the string 
>>>equivalence of the URI-converted form seems like the lowest reasonable 
>>>level of distinction to be encouraged.
>>
>>Well, there are some serious arguments against this:
>>- Some very important applications, in particular XML Namespaces
>>   and RDF, use this equivalence. So recommendation against this
>>   would cause confusion.
>>- Needing to convert to URIs for every comparison is inefficient
>>   (that was the main argument for namespaces)
>>- Needing to convert to URIs may lead to more URIs (rather than IRIs)
>>   floating around, because in some cases, the conversion would
>>   leak.
>>So that's why we should not go there.
>
>But these "important applications" are defined in terms of URIs, not IRIs.

No, not exactly. Namespaces 1.0 is defined in terms of URIs, but if you test
actual implementations (which I have done), it's implemented in terms of IRIs.
Namespaces 1.1 fixes that gap by using IRIs.
RDF uses something called RDF URI references, which, if you look closely,
are actually IRIs. Please check out
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-reference.


>I'm not suggesting that one should be required to convert to URIs for 
>every comparison,

Good.


>but that it might be discouraged to rely on differences between IRIs that 
>are not present on conversion to URIs.

Good point. This has been brought up in various discussions e.g. related
to namespaces. In terms of namespaces, it reads: Namespaces compare
character-by-character, so don't chage escaping. But don't try to use
two namespaces that only differ by escaping, because that would just
be stupid and would lead to confusion.

I added some text at the end of 5.1:

 >>>>
    On the other hand, even if the only forseen use of an IRI is as
    an identifier in scenarios that use character-by-character equivalence,
    creators of IRIs should not create IRIs that only differ by 
percent-escaping.
    As an example, using both http://example.org/~user and
    http://example.org/%7Euser to identify XML Namespaces is a bad idea.
 >>>>

I hope this addresses your concern.


>I note that your document specifically makes reference to conversion to 
>URIs being (notionally) used for a number of purpose, so in this respect 
>IRIs are not something whose existence is independent of URIs, and to that 
>extent I think to gloss over problems that might arise when conversion to 
>URIs is performed may leave room for problems.
>
>Please note that the general thrust of my comments is not to request any 
>change to the actual (normative) specification, but to clearly signal in 
>some way that problems might occur if these issues are not observed.
>
>>I hope the above addresses your concerns.
>
>I regard this as ultimately your call, and I won't raise any formal 
>objection if you don't agree with me, but I may continue to debate the 
>matter with you to the extent that it's helpful to you.

I hope we are getting closer. The more people reading the spec, the
better we can make sure we all know what it is supposed to say.


Regards,    Martin.
Received on Wednesday, 19 May 2004 03:08:43 UTC