- From: Rob Cameron <cameron@cs.sfu.ca>
- Date: Tue, 2 Mar 2004 07:07:35 -0800 (PST)
- To: uri@w3.org
Adam Costello's example is an interesting one. I just
sat down this morning to apply abnf2re to automatically
generate parsing expressions from the bis4 grammar.
> The latest draft says:
>
> An ABNF-driven parser will find that the border between
> authority and path is ambiguous; they are disambiguated by the
> "first-match-wins" (a.k.a. "greedy") algorithm. In other words,
> if authority is present then the first segment of the path must be
> empty.
>
> The second sentence does not follow from the first. Consider this URI:
>
> foo://joe@example.com:0x3FF/blah
>
> According to the grammar, this can be parsed in either of two ways:
>
> (1) authority =
> path = //joe@example.com:0x3FF/blah
>
> (2) authority = joe@example.com
> path = :0x3FF/blah
>
> It cannot be parsed this way:
>
> (3) authority = joe@example.com:0x3FF
> path = /blah
>
> because non-digits are not allowed in the port.
>
> The first-match-wins rule implies that the correct parsing is (2). Note
> that the first path segment is not empty, but is ":0x3FF".
>
Actually, the parsing results I get are as follows.
>>> from URIbis4 import *
>>> parseURI('foo://joe@example.com:0x3FF/blah')
('foo', 'joe@example.com:0', 'x3FF/blah', None, None)
That is, the first-match-wins rule gives
authority = joe@example.com:0
path = x3FF/blah
Previous grammars produce a different kind of anomaly.
>>> from URIbis3 import *
>>> parseURI('foo://joe@example.com:0x3FF/blah')
('foo', None, '//joe@example.com:0x3FF/blah', None, None)
Perhaps the following text is appropriate.
"An ABNF-driven parser will find that the border between
authority and path is ambiguous; they are disambiguated by the
"first-match-wins" (a.k.a. "greedy") algorithm. This produces
correct results whenever the authority is absent or the first
segment of the path is empty. Although the grammar permits
a nonempty path in the presence of an authority component,
the URI is considered ill-formed in this case."
> The regular expression in appendix B claims to break a well-formed URI
> down into its components, but it gets this one wrong, yielding the
> components in (3).
>
> Perhaps the grammar should be tightened up so that this URI is invalid.
> Note that the RFC-2396 grammar does not accept it.
>
> If the grammar is kept as-is, the regular expression should be fixed to
> parse this URI correctly, and the statement about the first path segment
> being necessarily empty should be removed. That might have implications
> for relative URI resolution...
>
> In any case, it might be nice for the draft to provide a regular
> expression that not only parses well-formed URIs, but also detects
> ill-formed URIs (by failing to match them).
>
This is of interest to me and is one of the reasons that I have
developed abnf2re. The goal is to provide regular expressions
that correspond exactly to the ABNF syntax within specification
documents.
Received on Tuesday, 2 March 2004 10:07:39 UTC