Re: A proposed solution from Henrik Frystyk Nielsen on 2000-06-18 (xml-uri@w3.org from June 2000)

From: Henrik Frystyk Nielsen <frystyk@microsoft.com>
Date: Sat, 17 Jun 2000 21:41:47 -0700
To: "James Clark" <jjc@jclark.com>
Cc: "David Turner" <dturner@microsoft.com>, <XML-uri@w3.org>, "Andrew Layman" <andrewl@microsoft.com>
Message-ID: <009301bfd8df$85ef6ab0$83b11eac@redmond.corp.microsoft.com>
From: "James Clark" <jjc@jclark.com>
Sent: Thursday, June 15, 2000 01:08
> I can't see anything in RFC 2396 that defines when two URI references
> are equivalent.  Perhaps you could point me to the section of RFC 2396
> that does this.  The reason the namespaces spec explicitly states the
> equivalence rules for namespace names is because RFC 2396 doesn't. For
> example, do you absolutize before octet-by-octet comparison or not? If
> you have a URI scheme that uses a server-based naming authority
(section
> 3.2.2), and two URLs use hostnames that are octet-for-octet distinct
but
> resolve to the same IP address, are the URLs equivalent.  I don't see
> anything in RFC 2396 that provides answers to questions like these.  I
> believe it's up to the namespaces spec to specify what the rules are.

You bring up a good point. For historic reasons, the comparison
algorithm is mentioned in the HTTP/1.1 spec, section 3.2.3 [1] where it
says

*****

When comparing two URIs to decide if they match or not, a client SHOULD
use a case-sensitive octet-by-octet comparison of the entire URIs, with
these exceptions:

        - A port that is empty or not given is equivalent to the default
          port for that URI-reference;

        - Comparisons of host names MUST be case-insensitive;

        - Comparisons of scheme names MUST be case-insensitive;

        - An empty abs_path is equivalent to an abs_path of "/".

Characters other than those in the "reserved" and "unsafe" sets (see RFC
2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

For example, the following three URIs are equivalent:

      http://abc.com:80/~smith/home.html
      http://ABC.com/%7Esmith/home.html
      http://ABC.com:/%7esmith/home.html

*****

The reason being that HTTP caching depends on being able to compare URIs
and it wasn't clear whether RFC 2396 would move forward in time for the
HTTP to move forward and so it was put in the HTTP spec. I have no
problem with it being moved to the URI spec but moving it to the
namespace spec I think makes no more sense than having it in the HTTP
spec.

However, I think the wording is entirely consistent with the wording
that I wrote as algorithm because HTTP already expects you to have made
the URI absolute:

1) At the basic level you compare on a octet by octet manner taking into
account the context in which any relative URIs are defined.

2) The URI spec defines a set of common syntax equivalence rules for the
hostname and the default port number etc. but I wouldn't bet that
applications get those consistently right.

3) A URI scheme may define further normalization rules that can have an
impact on how URIs are defined. However, as you can never expect that a
URI parser knows about the specific scheme you use, there is no
guarantee that those normalization rules are followed.

>>>> Relative URIs are always defined within a context.
>>>
>>> I thought they were defined relative to a base URI.  Is a context
the
>>> same thing as a base URI?
>>
>> The reason for using the term "context" instead of "base URI" is to
make
>> it clear that relative URIs in fact can be used within a constrained
>> context without actually knowing or using the base URI.
>
> That's not what RFC 2396 says. From section 5.1:
>
>    The term "relative URI" implies that there exists some absolute
"base
>    URI" against which the relative reference is applied.  Indeed, the
>    base URI is necessary to define the semantics of any relative URI
>    reference; without it, a relative reference is meaningless.

I would say that this refers to the *semantics* of a URI - it doesn't
refer to whether you *within* a document can use relative references
"foo" and "../foo" as unique identifiers within the context without
knowing or using the base URI of the document.

>>> If multiple levels of hierarchy count as the same context, then this
>>> proposal does not solve the problem. Suppose I have a document
>>> http://www.w3.org/a/b referencing an entity c/d which absolutizes to
>>> http://www.w3.org/a/c/d.  If these have the same context, then a
>>> namespace URI "foo" in the document will be treated as equal to a
>>> namespace URI "foo" in the referenced entity despite the fact that
it
>>> refers to a difference resource after URI absolutization.
>>
>> The examples refer to examples of relative URIs - not contexts.
>
> It's an example of two different base URIs within a single hierarchy
of
> documents.  The document has a base URI of http://www.w3.org/a/b, and
> the referenced entity has a base URI of http://www.w3.org/a/c/d.  Does
> the document have the same context as the referenced entity or not?

Nope, otherwise, as you say, it would cause problems with things that
would appear to be the same but would turn out to be different.

>>> A. cases where namespace names are identical but the corresponding
>>> resources are not
>>>
>>> B. cases where namespace names are not identical but the
corresponding
>>> resources are
>>>
>>> Now type B cases are relatively harmless and an unavoidable fact of
>>> life, but type A cases are (to some of us anyway) unacceptable. The
>>> Microsoft proposal appears to be getting rid of type A mismatches by
>>> accepting additional type B mismatches.
>>
>> Case A is definitely evil and yes, is avoided by our proposal. I
don't
>> see why that would lead to more type B mismatches though. I would
expect
>> it to stay the same.
>
> Suppose you have namespace names "a" and "./a".  These refer to the
same
> resource, but are not character-for-character identical.  However, if
> you absolutize them relative to the same base URI, then they will
> resolve to the same URI.

I see, if they weren't URIs (mere strings) then they would never be
(potentially) seen as being the same.

> The other case arise when you compare namespace names with different
> contexts. Suppose you have a namespace name "a" in a document with
base
> URI "http://www.w3.org/" and a namespace name "../a" in a document
with
> base URI "http://www.w3.org/2000/".   There needs to be a clear answer
> as to whether these are to be treated as identical or not.  If the
> proposal is that they are not identical because they have different
base
> URIs and so different contexts, then this is an additional type B
case.

Depending on the smarts of the application, these may or may not be
treated as the same but they will never result in a case A) error. I
agree that there is an infinite amount of case B) cases which can occur
at any level - it is always possible to discover that two things under
certain conditions are the same.

My point is that infinite+1 = infinite.

Henrik

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.3
Received on Sunday, 18 June 2000 00:42:24 UTC