authority sub-components

I just noticed a philosophical inconsistency in the current draft's
definition of the authority component.  I also found a bug, which I'll
discuss later in this message.  Consider the URI:


The RFC-1738 view was that there was no doubt what was what:

    joe is a user name
    abc is a password is an IPv4 address
    56789 is a port number

The RFC-2396 view is that there are two possibilities, which you cannot
distinguish unless you recognize the foo: scheme.  If foo: uses a
server-based authority, then the above types apply.  If foo: uses a
registry-based authority, then

    joe:abc@ is a reg_name
    (there is no user, password, IPv4 address, or port number)

[Aside:  I'll have to retract my recent claim that a validation grammar
can be used for decomposition.  A validation grammar can be ambiguous,
and indeed the RFC-2396 grammar is ambiguous here.  A decomposition
grammar cannot be ambiguous, but can be overly permissive.]

The current draft of 2396bis seems to be heading back in the direction
of RFC-1738 for the authority component.  It has eliminated @ and : from
the set of characters allowed in a reg-name, and pushed reg-name down
to be a sub-component of authority, rather than the entire authority.
According to the current draft, even if we don't recognize the foo:
scheme, we can be sure that

    joe:abc is userinfo is an IPv4 address
    56789 is a port number

So far, 2396bis is following the RFC-1738 model of knowing what's
what without having to recognize the scheme.  But for the reg-name
subcomponent, it's back to the RFC-2396 model of needing to recognize
the scheme in order to know what it is.


    joe:abc is userinfo
    56789 is a port number is a reg-name,
      which might or might not be a hostname depending on foo:

If 2396bis is willing to say that is always an IPv4 address
and never a reg-name, it must place a pretty high value on the ability
to distinguish an IP address without having to recognize the scheme.
Hostnames are an internet data type as fundamental as IP addresses, so
it ought to be equally valuable to be able to distinguish a hostname
without having to recognize the scheme.  Why is 2396bis willing to
reserve part of the potential reg-name space for IPv4 addresses but not
for hostnames?

Now for the bug...  Section 2.3 says:

    URIs that differ in the replacement of an unreserved character
    with its corresponding percent-encoded octet are equivalent: they
    identify the same resource.

    For consistency, percent-encoded octets in the ranges of ALPHA
    (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
    underscore (%5F), or tilde (%7E) should not be created by URI
    producers and, when found in a URI, should be decoded to their
    corresponding unreserved character by URI normalizers.

But consider the URI:


The authority is a reg-name, not an IP address, according to the ABNF
grammar and the first-match-wins rule (and the text "If host matches
the rule for IPv4address, then it should be considered an IPv4 address
literal and not a reg-name.").

The %31 encodes an unreserved character, and section 2.3 says that
decoding it will result in an equivalent URI, and that such decoding
should be done by URI normalizers.  The result, however, is:


The authority here is an IPv4 address, not a reg-name.  These are
different data types with different semantics.  The reg-name %
refers to the entry "" in a scheme-specific name registry, while
the IPv4 address refers to an IP network interface.

One way to fix this bug would be to allow percent-encoding within
IPv4address, but that would not be compatible with deployed software.

Another way to fix the bug would be to go back more toward the RFC-2396
model (but not all the way back):

    authority = [ userinfo "@" ] coordinator [ ":" port ]
    coordinator = reg-name / host ; first match does NOT win, the scheme
                                  ; determines which alternative applies
    host = IP-literal / IPv4address / hostname

This way, and are both equally indeterminate.  Both
might or might not be reg-names, depending on the scheme.  Rewriting
foo://% as foo:// is safe, because if foo://%
is valid, then foo: must be a scheme that uses reg-name coordinators
rather than host coordinators, and therefore is still a reg-name
in foo://

Another way to fix this bug would be to make IPv4address and reg-name
distinguishable without having to resort to the first-match-wins rule.
For example:

    host = IP-literal / IPv4address / hostname / "-" reg-name

There would be no overlap between any of these four tokens, no need to
invoke the first-match-wins rule, and both IP addresses and hostnames
would be recognizable without needing to recognize the scheme.  The
downside is that this would invalidate things that were valid under
RFC-2396.  2396bis already invalidates some things that were valid
under RFC-2396, like foo://a@b@c/, but this would invalidate almost all
non-hostname-syntax names.  Do there currently exist URI schemes that
use non-hostname-syntax names in the authority component?

By the way, I wonder if the token name "reg-name" is misleading.  The
2396bis reg-name is quite distinct from the RFC-2396 reg_name.  It
occupies a different position in the grammar and has a different
meaning.  The RFC-2396 reg_name was a kind of naming authority that is
not a server (and is not a host, and never has users or ports), while
the 2396bis reg-name is a kind of host (which could have users and
ports).  These are quite different concepts.


Received on Thursday, 4 March 2004 04:01:22 UTC