Re: Posted draft of URI comparison finding

On Saturday, November 30, 2002, 12:54:30 AM, David wrote:

DO> I think this is an excellent idea.  We should also make sure that we have
DO> these comparison types easily referencable from other specifications.  This
DO> way specs could easily refer into the comparison types.

Here is another comparison type (hostname case insensitive, optional
default portnumber)

6  URI Normalization and Equivalence

In many cases, different URI strings may actually identify the
identical resource. For example, the host names used in URI are
actually case insensitive, and the URI <http://www.XEROX.com> is
equivalent to <http://www.xerox.com>. In general, the rules for
equivalence and definition of a normal form, if any, are scheme
dependent. When a scheme uses elements of the common syntax, it will
also use the common syntax equivalence rules, namely that the scheme
and hostname are case insensitive and a URI with an explicit ":port",
where the port is the default for the scheme, is equivalent to one
where the port is elided.
http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html#rfc.section.6

So under these common syntax equivalence rules,

http://www.W3.oRg  and
http://www.w3.org:80

are identical, but different from

http://www.w3.org:8086

This seems to require the comparing mechanism to have a table of all
the default port numbers of all schemes - even if (for example, when
used as an XML namespace name) it does not plan to dereference the URI
and thus would not generally need to know the port number.

Which seems to mean that

(the spec says)
ChrisLilleyFooML namespace is clilley://example.org/FooML

and
(the xml instance has)
xmlns="clilley://example.org:761/FooML"

is either equivalent or not equivalent depending on the (unpublished
and hypothetical) clilley URI scheme.

Perhaps this gives a very practical tie-in to
http://www.w3.org/TR/2002/WD-webarch-20021115/#URI-scheme

which used to say not to use unregistered schemes, but now does not
(because testing requires use before registration).

Perhaps we should append, after "While "myscheme:blort" is a URI that
satisfies the syntactic constraints of [RFC2396], if "myscheme" is not
registered, you are not guaranteed that somebody else isn't already
using it for something else"  the caution "and have no idea what the
default port number is".

Either way, what RFC 2396bis says should be referenced from
http://www.textuality.com/tag/uri-comp.html and, while agreeing with
the general comment that "It is not appropriate to enumerate all the
consequences of RFC2396's rules here" the portnumber rule seems like a
useful example to add.

Or, we could say that URI comparison where the URI is used merely as a
name, and not as an actual dereferencable network address, does not
use the equivalence scheme from rfc2396bis or its predecessors. In
other words, recommend that 'Simple String Comparison' (or variant as
defined below) is used, not 'RFC 2396-Sensitive Comparison' (which
should perhaps be termed "RFC 2396 common syntax equivalence rules" to
distinguish them from scheme-specific equivalence rules).

Which requires accepting that URI comparison is, indeed, specification
specific. Whether two URIs are equivalent depends on why you want to
know, and what you plan to do with the information. This makes me
uncomfortable - I had some sympathy for TimBLs assertion that URI
comparison is not spec specific - but equally, there are such a wide
range of circumstances where URIs are compared. The constraints and
expected results for comparing two namespace URIs  are not the same
as, for example, a proxy cache comparing the incoming URI request with
what resources (including variants and etags and last modify dates) it
has in its cache.

This, in turn, requires that the different URI equivalence functions
had better start giving themselves names, instead of always using the
term equivalence. (Not in TimBs document, which already does
this, but in other documents). As StuartW pointed out, there are in
mathematics a host of equivalence functions, many of which are not the
identity function.

"Simple string comparison" is one such named function, provided that
the terms string and character are defined. MishaW's mail seemed to
give a useful definition for character. Sources of variability in an
XML document that are removed in parsing, such as NCRs and entity
references, do not affect simple string comparison comparisons done on
the parsed XML source. Sources of variability that persist in the
parsed xml source, such as the case of hex URI escapes, do affect
simple string comparison equivalence. Defaulted port numbers,
similarly, do affect simple string comparison equivalence.

>> From: www-tag-request@w3.org 
>> [mailto:www-tag-request@w3.org]On Behalf Of
>> Paul Cotton

>> I wonder if it might be useful to give some examples of how W3C
>> specifications support the comparison techniques outlined in 
>> this draft
>> finding.
>> 
>> For example in the section entitled "Simple String 
>> Comparison" you could
>> point to the op:anyURI-equal function defined in the XQuery 1.0 and
>> XPath 2.0 Functions and Operators Working Draft [1].
>> 
>> /paulc
>> 
>> [1] http://www.w3.org/TR/xquery-operators/#func-anyURI-equal 

>> > From: Tim Bray [mailto:tbray@textuality.com]

>> > I just posted, at 
>> http://www.textuality.com/tag/uri-comp.html, a first
>> > cut at some finding language in comparing URIs.  I'm in 
>> Narita running
>> > for a plane so this got less proofreading than I usually have time
>> for.
>> > 
>> > The subject expands remarkably once you start writing it all down.

It sure does. Hence my request at the TAG f2f that we try and
constrain the problem somewhat - what classes of URI comparison are we
planning on addressing in the finding?

TimB, in your document, section entitled "Rules Governing URIs" the
first two paragraphs talk of characters and the third skips on to
bytes without examining the relationship between the two. I agree that
RFC 2396 has the same mistake, hence the need for IRI, but the
ambiguity should at least be noted in passing in that section, I feel.
Its treated later, right at the end of '%-Escaping Issues' but that is
too late to introduce such an important concept.

"It would seem almost wilfully perverse to consider the characters
represented respectively by %7A and %7a in the example above as
different. In fact, since the Namespaces in XML recommendation
specifies "character-for-character" comparison, it might be argued
that since %7A and %7a must per RFC2396 represent the same character,
XML namespaces which differ only in this respect might reasonably be
considered equal."

Yes, and this is why the definition of a character is important.
Simple string comparison could have a variant, 'Hex-escape-aware
String Comparison' that defines '%7a' and %7A' and 'z' to be one
character, and to be the same character, distinct from '%5A' and '%5a'
and 'Z'.

But currently, simple string comparison rightly considers '%7A' to be
three characters and thus clearly different from 'z' which is one
character. Because, after XML parsing, these sources of variability
persist.

So, please add 'Hex-escape-aware String Comparison' to
http://www.textuality.com/tag/uri-comp.html
so that it can be discussed and, ideally in my view, adopted for XML
1.1 namespace comparison. Or, if it is not adopted and Simple String
Comparison is retained, then that decision should be taken with
knowledge of, and documentation of, the 'wilfully perverse'
consequences.

One last point - the example of comparing a namespace URI in an XML
instance with a namespace URI printed in a specification. It seems to
me that there is scope for a lot of variability there, especially with
a printed version of a spec. Is that a space (perhaps forbidden) or a
non-breaking space or an ideographic space? Of course this is not a
new issue - is that a "1" or an "l", an "O" or a "0" etc.

If the hex-aware string comparison scheme was used, then an appendix
could provide an unambiguous and authoritative fully hexified form of
the namespace URI, for incorporation into software; it would match the
unhexified or partially-hexified form correctly and since it used only
0-9 a-f and % it would be typographicaly unambiguous even when
printed.

-- 
 Chris                            mailto:chris@w3.org

Received on Monday, 2 December 2002 10:10:34 UTC