Re: draft-duerst-iri-07.txt: 2 week mailing list last call from Martin Duerst on 2004-05-26 (public-iri@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 26 May 2004 14:45:32 +0900
To: Graham Klyne <gk@ninebynine.org>, public-iri@w3.org
Message-Id: <4.2.0.58.J.20040520155256.0574be08@localhost>
Hello Graham,

At 10:06 04/05/19 +0100, Graham Klyne wrote:

>Martin,
>
>I think our discussion becomes one of specification style.  I appreciate 
>the need to prepare a specification that reflects what applications 
>actually do, but I'd prefer that it not do so by saying that future 
>applications must follow the same procedures as existing applications.

I agree. But the current text doesn't use a MUST, only a SHOULD, and
clearly gives conditions. If there are scenarios where speed is not
of importance to the extent that they need to use character-by-character
comparison, then the text doesn't apply anymore.


>(By "application", I mean here any end-user function that employs the 
>technologies.  So for this debate, I'd view 'XML Namespaces' or 'RDF' as 
>part of the infrastructure whose functionality might or might not be part 
>of an application.)

Well, in that case, the applications really must follow the infrastructure
where they use it, otherwise we get chaos. But that's for these
infrastructure specs to make sure, not for the IRI spec.
[and by the way, that part of the spec avoids any use of the word
'application' :-).]


>Unfortunately, I've lost the context of the original comment, so now that 
>it's been aired I'll leave it to your discretion.

Okay, then I declare this issue closed.

[By the way, if it turns out that some of the MUST/SHOULD don't
make sense when we try to go to Draft Standard, then we will just
have to take them out. That's what I understand the Draft Standard
stage is there for.]


>Regarding your suggested addition:
>> >>>>
>>    On the other hand, even if the only forseen use of an IRI is as
>>    an identifier in scenarios that use character-by-character equivalence,
>>    creators of IRIs should not create IRIs that only differ by 
>> percent-escaping.
>>    As an example, using both http://example.org/~user and
>>    http://example.org/%7Euser to identify XML Namespaces is a bad idea.
>> >>>>
>
>I think this helps, but I'd probably strengthen it a bit, thus:
>[[
>    When an IRI is used as an identifier in scenarios that depend
>    upon character-by-character equivalence, creators of IRIs should
>    take additional care to avoid IRIs that only differ in their use
>    of percent-escaping.
>    As an example, using both http://example.org/~user and
>    http://example.org/%7Euser to identify XML Namespaces is a bad idea.
>]]
>
>(Changes:  (a) avoid reference to "foreseen" use;  (b) try to avoid the 
>implication that it's OK in other scenarios -- I think we'd agree that 
>it's never ideal to coin IRIs that differ only in their use of escaping.)

I have adopted your text, thanks!

Regards,    Martin.

>#g
>--
>
>
>At 14:18 19/05/04 +0900, Martin Duerst wrote:
>>Hello Graham,
>>
>>At 12:44 04/05/12 +0100, Graham Klyne wrote:
>>
>>>At 18:00 12/05/04 +0900, Martin Duerst wrote:
>>>>Hello Graham,
>>>>
>>>>I have made this issue charcompareMUST-31.
>>>>
>>>>
>>>>At 12:02 04/05/10 +0100, Graham Klyne wrote:
>>>>
>>>>
>>>>>Section 5.1:
>>>>>
>>>>>[[
>>>>>5.1  Simple String Comparison
>>>>>
>>>>>    In some scenarios a definite answer to the question of IRI
>>>>>    equivalence is needed that is independent of the scheme used and
>>>>>    always can be calculated quickly and without accessing a network. An
>>>>>    example of such a case is XML Namespaces ([XMLNamespace]). In such
>>>>>    cases, two IRIs SHOULD be defined as equivalent if and only if they
>>>>>    are character-by-character equivalent. This is the same as being
>>>>>    byte-by-byte equivalent if the character encoding for both IRIs is
>>>>>    the same. As an example,
>>>>>    http://example.org/~user, http://example.org/%7euser, and
>>>>>    http://example.org/%7Euser are not equivalent under this definition.
>>>>>    In such a case, the comparison function MUST NOT map IRIs to URIs,
>>>>>    because such a mapping would create additional spurious equivalences.
>>>>>]]
>>>>>
>>>>>It's not clear to me what the MUST NOT here is saying.  Making 
>>>>>normative statements that are conditional on some postulated 
>>>>>application scenario seems to be a bit confusing to me.
>>>>
>>>>If you interpreted the statement as conditional on some application
>>>>scenario, then it is indeed confusing. It was intended conditional
>>>>to the comparison function. I.e. if you use character-by-character
>>>>comparison, you MUST NOT map IRIs to URIs,
>>>>because such a mapping would create additional spurious equivalences.
>>>
>>>I was taking the choice of comparison function to be part of the 
>>>application scenario.
>>
>>Well, I think it depends on what you mean by 'application'. If
>>'XML Namespaces' or 'RDF' are applications, then you are right.
>>If each XML or RDF parser are applications, then it is wrong.
>>
>>
>>>>I have replaced "In such a case" with "When comparing 
>>>>character-by-character".
>>>
>>>I think that's better, though it doesn't quite capture my original 
>>>comment.  (Consider:  as this is given as a normative statement, how do 
>>>you propose to find interoperable implementations to demonstrate 
>>>conformance when moving to Draft Standard?  I still prefer my suggestion 
>>>(below), but  now I've raised the issue I'm happy for you to decide.
>>
>>Well, testing a few XSLT implementations would clearly show that they
>>treat IRIs as such, and don't convert them to URIs, for namespace URIs.
>>I have already done such tests quite a while ago.
>>So I don't think demonstrating conformance would be a big deal for this.
>>
>>
>>>>>I think the final sentence maybe should be:
>>>>>[[
>>>>>The IRI to URI mapping function described above [ref] does not 
>>>>>preserve this form of equivalence.
>>>>>]]
>>>
>>>
>>>>>(Further, the MUST NOT here seems even more perverse in light of the 
>>>>>introductory material in section 3.1)
>>>>
>>>>I have checked that material again, and did not find any problems.
>>>>You may observe that that material is carefully worded in terms of
>>>>retrieval when it comes to IRI->URI mapping, not in terms of
>>>>abstract resource identification.
>>>
>>>OK, ignore that last comment.  (I wasn't specifically thinking about 
>>>abstract identification.)
>>>
>>>But I note that it's not obvious to me that start of section 3.1 is 
>>>subject to the mention of "resource retrieval" that appears in section 3[.0].
>>
>>This is again taken up and explained in more detail in "b) 
>>Interpretational:".
>>
>>
>>>Indeed the fact that the material in 3.1 is also said to apply to 
>>>references and fragment identifiers suggests otherwise.
>>
>>You mean the sentence:
>>"Everything in this section applies also to IRI references and URI 
>>references,
>>as well as components thereof (for example fragment identifiers)."
>>
>>I agree that the 'everything' may be a bit general. But what is 'everything'
>>of 3.1? First, it is the mapping procedure itself. This means that you can
>>map an IRI reference to an URI reference in the same way you can map an
>>IRI to an URI. And you can map a fragid in an IRI to a fragid in an URI
>>in the same way as you map an IRI to an URI.
>>
>>Second, it is the context/purpose. Two purposes are given:
>>- retrieval
>>- syntax check
>>With respect to syntax check, the sentence above means that you can
>>syntax check an IRI reference or a fragid of an IRI by converting to
>>an URI reference or fragid of an URI. With respect to retrieval,
>>with an appropriate base, IRI references and URI references can
>>definitely be retrieved. You are right that retrieval of
>>fragment identifiers does not make that much sense.
>>
>>
>>>Checking for scheme-specific syntax restrictions does not seem to be 
>>>specifically related to resource retrieval.  (cf. URN syntax checking.)
>>
>>Yes, right, I forgot to mention it before.
>>
>>
>>>Looking more closely at point (b) in 3.1, which clearly *is* about 
>>>resource retrieval, I find myself having further qualms:
>>>[[
>>>However, when an IRI is used for resource
>>>retrieval, the resource that the IRI locates is the same as the
>>>one located by the URI obtained after converting the IRI according
>>>to the procedure defined here. This means that there is no need to
>>>define resolution separately on the IRI level.
>>>]]
>>>
>>>This seems to preclude the possibility of defining a resolution protocol 
>>>that uses IRIs natively.
>>
>>No. This just precludes the possibility of defining a resolution protocol
>>that makes a distinction e.g. between '%7e', '%7E', and '~'. Most resolution
>>protocols are not able to make this distinction, because they do not use
>>percent-encoding, and therefore the client resolves these differences.
>>The only protocol I know that doesn't is HTTP, and HTTP also defines
>>that these are equivalent, and servers implement things that way
>>(except where the percent-encoding is used to distinguish reserved
>>characters that the server has to deal with, such as '?').
>>
>>
>>>Effectively, this is an imposition on any future protocol specification 
>>>that can be used to resolve IRIs, which seems like a rather broad 
>>>sweep.  Maybe this is OK, and really is what was intended, but I feel 
>>>compelled to at least mention the point.   If this is what you intend, I 
>>>think the point would usefully be more prominent in the text, and should 
>>>be made a normative assertion;  e.g. a top-level paragraph ala:
>>>[[
>>>When an IRI is used for resource retrieval, >>it must be by means of a 
>>>protocol that
>>>can also be used with URIs, and<<
>>
>>This would be a tautology, because all URIs are by definition IRIs.
>>
>>
>>>the resource that the IRI locates MUST be the same as
>>>the one located by the URI obtained after converting the IRI according 
>>>to the
>>>procedure defined here.
>>>]]
>>
>>This is given normatively in Section 5.2:
>>
>>"For actual resolution, differences in percent-encoding (except for the
>>percent-encoding of reserved characters) MUST always result in the same 
>>resource."
>>
>>Do you want to say that this should be in two places, or that it is
>>in the wrong place?
>>
>>
>>>It might be argued that the text between >> and << is redundant, to the 
>>>extent that any URI is also a valid IRI.
>>
>>Yes indeed, I just did that, sorry.
>>
>>
>>>(But, thinking aloud, ... suppose I wanted to invent a new IRI scheme 
>>>and protocol to serve as a kind of Chinese WordNet, with definitions 
>>>retrievable in much the same way as they are for 
>>>WordNet.  (Notwithstanding that this may not be a good idea for other 
>>>reasons.)  In such a scheme, maybe there is a component which, according 
>>>to the IRI scheme specification, must contain Chinese character 
>>>symbol(s), so there are no URIs that are valid IRIs according to this scheme.
>>
>>No, all the URIs that result from percent-encoding the UTF-8-encoded
>>Chinese characters would be legal. Not necessarily in the protocol
>>(the protocol may specify that these are just transmitted as raw octets),
>>but on the URI/IRI level.
>>
>>
>>>I don't know where this leads.  My main point is to try and raise a 
>>>vaguely plausible scenario in which existence of a URI form for resource 
>>>retrieval may be undesirable.)
>>
>>Well, of course it would not be very desirable to use the percent-encoded
>>form of a Chinese WordNet IRI, at least not for Chinese speakers. But
>>it could be convenient for you, assuming you don't read Chinese.
>>
>>
>>>>>I suspect there should be some discouragement of applications 
>>>>>depending on this level of equivalence, in view of the spurious 
>>>>>distinctions that are lost when IRIs are converted to URIs.   To my 
>>>>>mind the string equivalence of the URI-converted form seems like the 
>>>>>lowest reasonable level of distinction to be encouraged.
>>>>
>>>>Well, there are some serious arguments against this:
>>>>- Some very important applications, in particular XML Namespaces
>>>>   and RDF, use this equivalence. So recommendation against this
>>>>   would cause confusion.
>>>>- Needing to convert to URIs for every comparison is inefficient
>>>>   (that was the main argument for namespaces)
>>>>- Needing to convert to URIs may lead to more URIs (rather than IRIs)
>>>>   floating around, because in some cases, the conversion would
>>>>   leak.
>>>>So that's why we should not go there.
>>>
>>>But these "important applications" are defined in terms of URIs, not IRIs.
>>
>>No, not exactly. Namespaces 1.0 is defined in terms of URIs, but if you test
>>actual implementations (which I have done), it's implemented in terms of 
>>IRIs.
>>Namespaces 1.1 fixes that gap by using IRIs.
>>RDF uses something called RDF URI references, which, if you look closely,
>>are actually IRIs. Please check out
>>http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-reference.
>>
>>
>>>I'm not suggesting that one should be required to convert to URIs for 
>>>every comparison,
>>
>>Good.
>>
>>
>>>but that it might be discouraged to rely on differences between IRIs 
>>>that are not present on conversion to URIs.
>>
>>Good point. This has been brought up in various discussions e.g. related
>>to namespaces. In terms of namespaces, it reads: Namespaces compare
>>character-by-character, so don't chage escaping. But don't try to use
>>two namespaces that only differ by escaping, because that would just
>>be stupid and would lead to confusion.
>>
>>I added some text at the end of 5.1:
>>
>> >>>>
>>    On the other hand, even if the only forseen use of an IRI is as
>>    an identifier in scenarios that use character-by-character equivalence,
>>    creators of IRIs should not create IRIs that only differ by 
>> percent-escaping.
>>    As an example, using both http://example.org/~user and
>>    http://example.org/%7Euser to identify XML Namespaces is a bad idea.
>> >>>>
>>
>>I hope this addresses your concern.
>>
>>
>>>I note that your document specifically makes reference to conversion to 
>>>URIs being (notionally) used for a number of purpose, so in this respect 
>>>IRIs are not something whose existence is independent of URIs, and to 
>>>that extent I think to gloss over problems that might arise when 
>>>conversion to URIs is performed may leave room for problems.
>>>
>>>Please note that the general thrust of my comments is not to request any 
>>>change to the actual (normative) specification, but to clearly signal in 
>>>some way that problems might occur if these issues are not observed.
>>>
>>>>I hope the above addresses your concerns.
>>>
>>>I regard this as ultimately your call, and I won't raise any formal 
>>>objection if you don't agree with me, but I may continue to debate the 
>>>matter with you to the extent that it's helpful to you.
>>
>>I hope we are getting closer. The more people reading the spec, the
>>better we can make sure we all know what it is supposed to say.
>>
>>
>>Regards,    Martin.
>
>------------
>Graham Klyne
>For email:
>http://www.ninebynine.org/#Contact
Received on Wednesday, 26 May 2004 05:23:36 UTC