grammar weirdness

The latest draft says:

    An ABNF-driven parser will find that the border between
    authority and path is ambiguous; they are disambiguated by the
    "first-match-wins" (a.k.a. "greedy") algorithm.  In other words,
    if authority is present then the first segment of the path must be

The second sentence does not follow from the first.  Consider this URI:


According to the grammar, this can be parsed in either of two ways:

(1) authority =
    path = //

(2) authority =
    path = :0x3FF/blah

It cannot be parsed this way:

(3) authority =
    path = /blah

because non-digits are not allowed in the port.

The first-match-wins rule implies that the correct parsing is (2).  Note
that the first path segment is not empty, but is ":0x3FF".

The regular expression in appendix B claims to break a well-formed URI
down into its components, but it gets this one wrong, yielding the
components in (3).

Perhaps the grammar should be tightened up so that this URI is invalid.
Note that the RFC-2396 grammar does not accept it.

If the grammar is kept as-is, the regular expression should be fixed to
parse this URI correctly, and the statement about the first path segment
being necessarily empty should be removed.  That might have implications
for relative URI resolution...

In any case, it might be nice for the draft to provide a regular
expression that not only parses well-formed URIs, but also detects
ill-formed URIs (by failing to match them).


Received on Tuesday, 2 March 2004 07:00:09 UTC