Re: grammar weirdness from Rob Cameron on 2004-03-02 (uri@w3.org from March 2004)

From: Rob Cameron <cameron@cs.sfu.ca>
Date: Tue, 2 Mar 2004 07:07:35 -0800 (PST)
To: uri@w3.org
Message-Id: <200403021507.i22F7Zf27397@orpheus.cs.sfu.ca>
Adam Costello's example is an interesting one.  I just
sat down this morning to apply abnf2re to automatically
generate parsing expressions from the bis4 grammar.

> The latest draft says:
> 
>     An ABNF-driven parser will find that the border between
>     authority and path is ambiguous; they are disambiguated by the
>     "first-match-wins" (a.k.a. "greedy") algorithm.  In other words,
>     if authority is present then the first segment of the path must be
>     empty.
> 
> The second sentence does not follow from the first.  Consider this URI:
> 
>     foo://joe@example.com:0x3FF/blah
> 
> According to the grammar, this can be parsed in either of two ways:
> 
> (1) authority =
>     path = //joe@example.com:0x3FF/blah
> 
> (2) authority = joe@example.com
>     path = :0x3FF/blah
> 
> It cannot be parsed this way:
> 
> (3) authority = joe@example.com:0x3FF
>     path = /blah
> 
> because non-digits are not allowed in the port.
> 
> The first-match-wins rule implies that the correct parsing is (2).  Note
> that the first path segment is not empty, but is ":0x3FF".
> 

Actually, the parsing results I get are as follows.

>>> from URIbis4 import *
>>> parseURI('foo://joe@example.com:0x3FF/blah')
('foo', 'joe@example.com:0', 'x3FF/blah', None, None)

That is, the first-match-wins rule gives
   authority = joe@example.com:0
   path = x3FF/blah

Previous grammars produce a different kind of anomaly.
>>> from URIbis3 import *
>>> parseURI('foo://joe@example.com:0x3FF/blah')
('foo', None, '//joe@example.com:0x3FF/blah', None, None)

Perhaps the following text is appropriate.

"An ABNF-driven parser will find that the border between
authority and path is ambiguous; they are disambiguated by the
"first-match-wins" (a.k.a. "greedy") algorithm.  This produces
correct results whenever the authority is absent or the first
segment of the path is empty.   Although the grammar permits
a nonempty path in the presence of an authority component,
the URI is considered ill-formed in this case."

> The regular expression in appendix B claims to break a well-formed URI
> down into its components, but it gets this one wrong, yielding the
> components in (3).
> 
> Perhaps the grammar should be tightened up so that this URI is invalid.
> Note that the RFC-2396 grammar does not accept it.
> 
> If the grammar is kept as-is, the regular expression should be fixed to
> parse this URI correctly, and the statement about the first path segment
> being necessarily empty should be removed.  That might have implications
> for relative URI resolution...
> 
> In any case, it might be nice for the draft to provide a regular
> expression that not only parses well-formed URIs, but also detects
> ill-formed URIs (by failing to match them).
> 

This is of interest to me and is one of the reasons that I have
developed abnf2re.  The goal is to provide regular expressions
that correspond exactly to the ABNF syntax within specification
documents.
Received on Tuesday, 2 March 2004 10:07:39 UTC