- From: Chris Lilley <chris@w3.org>
- Date: Mon, 2 Dec 2002 16:10:25 +0100
- To: www-tag@w3.org, "David Orchard" <dorchard@bea.com>
On Saturday, November 30, 2002, 12:54:30 AM, David wrote: DO> I think this is an excellent idea. We should also make sure that we have DO> these comparison types easily referencable from other specifications. This DO> way specs could easily refer into the comparison types. Here is another comparison type (hostname case insensitive, optional default portnumber) 6 URI Normalization and Equivalence In many cases, different URI strings may actually identify the identical resource. For example, the host names used in URI are actually case insensitive, and the URI <http://www.XEROX.com> is equivalent to <http://www.xerox.com>. In general, the rules for equivalence and definition of a normal form, if any, are scheme dependent. When a scheme uses elements of the common syntax, it will also use the common syntax equivalence rules, namely that the scheme and hostname are case insensitive and a URI with an explicit ":port", where the port is the default for the scheme, is equivalent to one where the port is elided. http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html#rfc.section.6 So under these common syntax equivalence rules, http://www.W3.oRg and http://www.w3.org:80 are identical, but different from http://www.w3.org:8086 This seems to require the comparing mechanism to have a table of all the default port numbers of all schemes - even if (for example, when used as an XML namespace name) it does not plan to dereference the URI and thus would not generally need to know the port number. Which seems to mean that (the spec says) ChrisLilleyFooML namespace is clilley://example.org/FooML and (the xml instance has) xmlns="clilley://example.org:761/FooML" is either equivalent or not equivalent depending on the (unpublished and hypothetical) clilley URI scheme. Perhaps this gives a very practical tie-in to http://www.w3.org/TR/2002/WD-webarch-20021115/#URI-scheme which used to say not to use unregistered schemes, but now does not (because testing requires use before registration). Perhaps we should append, after "While "myscheme:blort" is a URI that satisfies the syntactic constraints of [RFC2396], if "myscheme" is not registered, you are not guaranteed that somebody else isn't already using it for something else" the caution "and have no idea what the default port number is". Either way, what RFC 2396bis says should be referenced from http://www.textuality.com/tag/uri-comp.html and, while agreeing with the general comment that "It is not appropriate to enumerate all the consequences of RFC2396's rules here" the portnumber rule seems like a useful example to add. Or, we could say that URI comparison where the URI is used merely as a name, and not as an actual dereferencable network address, does not use the equivalence scheme from rfc2396bis or its predecessors. In other words, recommend that 'Simple String Comparison' (or variant as defined below) is used, not 'RFC 2396-Sensitive Comparison' (which should perhaps be termed "RFC 2396 common syntax equivalence rules" to distinguish them from scheme-specific equivalence rules). Which requires accepting that URI comparison is, indeed, specification specific. Whether two URIs are equivalent depends on why you want to know, and what you plan to do with the information. This makes me uncomfortable - I had some sympathy for TimBLs assertion that URI comparison is not spec specific - but equally, there are such a wide range of circumstances where URIs are compared. The constraints and expected results for comparing two namespace URIs are not the same as, for example, a proxy cache comparing the incoming URI request with what resources (including variants and etags and last modify dates) it has in its cache. This, in turn, requires that the different URI equivalence functions had better start giving themselves names, instead of always using the term equivalence. (Not in TimBs document, which already does this, but in other documents). As StuartW pointed out, there are in mathematics a host of equivalence functions, many of which are not the identity function. "Simple string comparison" is one such named function, provided that the terms string and character are defined. MishaW's mail seemed to give a useful definition for character. Sources of variability in an XML document that are removed in parsing, such as NCRs and entity references, do not affect simple string comparison comparisons done on the parsed XML source. Sources of variability that persist in the parsed xml source, such as the case of hex URI escapes, do affect simple string comparison equivalence. Defaulted port numbers, similarly, do affect simple string comparison equivalence. >> From: www-tag-request@w3.org >> [mailto:www-tag-request@w3.org]On Behalf Of >> Paul Cotton >> I wonder if it might be useful to give some examples of how W3C >> specifications support the comparison techniques outlined in >> this draft >> finding. >> >> For example in the section entitled "Simple String >> Comparison" you could >> point to the op:anyURI-equal function defined in the XQuery 1.0 and >> XPath 2.0 Functions and Operators Working Draft [1]. >> >> /paulc >> >> [1] http://www.w3.org/TR/xquery-operators/#func-anyURI-equal >> > From: Tim Bray [mailto:tbray@textuality.com] >> > I just posted, at >> http://www.textuality.com/tag/uri-comp.html, a first >> > cut at some finding language in comparing URIs. I'm in >> Narita running >> > for a plane so this got less proofreading than I usually have time >> for. >> > >> > The subject expands remarkably once you start writing it all down. It sure does. Hence my request at the TAG f2f that we try and constrain the problem somewhat - what classes of URI comparison are we planning on addressing in the finding? TimB, in your document, section entitled "Rules Governing URIs" the first two paragraphs talk of characters and the third skips on to bytes without examining the relationship between the two. I agree that RFC 2396 has the same mistake, hence the need for IRI, but the ambiguity should at least be noted in passing in that section, I feel. Its treated later, right at the end of '%-Escaping Issues' but that is too late to introduce such an important concept. "It would seem almost wilfully perverse to consider the characters represented respectively by %7A and %7a in the example above as different. In fact, since the Namespaces in XML recommendation specifies "character-for-character" comparison, it might be argued that since %7A and %7a must per RFC2396 represent the same character, XML namespaces which differ only in this respect might reasonably be considered equal." Yes, and this is why the definition of a character is important. Simple string comparison could have a variant, 'Hex-escape-aware String Comparison' that defines '%7a' and %7A' and 'z' to be one character, and to be the same character, distinct from '%5A' and '%5a' and 'Z'. But currently, simple string comparison rightly considers '%7A' to be three characters and thus clearly different from 'z' which is one character. Because, after XML parsing, these sources of variability persist. So, please add 'Hex-escape-aware String Comparison' to http://www.textuality.com/tag/uri-comp.html so that it can be discussed and, ideally in my view, adopted for XML 1.1 namespace comparison. Or, if it is not adopted and Simple String Comparison is retained, then that decision should be taken with knowledge of, and documentation of, the 'wilfully perverse' consequences. One last point - the example of comparing a namespace URI in an XML instance with a namespace URI printed in a specification. It seems to me that there is scope for a lot of variability there, especially with a printed version of a spec. Is that a space (perhaps forbidden) or a non-breaking space or an ideographic space? Of course this is not a new issue - is that a "1" or an "l", an "O" or a "0" etc. If the hex-aware string comparison scheme was used, then an appendix could provide an unambiguous and authoritative fully hexified form of the namespace URI, for incorporation into software; it would match the unhexified or partially-hexified form correctly and since it used only 0-9 a-f and % it would be typographicaly unambiguous even when printed. -- Chris mailto:chris@w3.org
Received on Monday, 2 December 2002 10:10:34 UTC