- From: Rob Cameron <cameron@cs.sfu.ca>
- Date: Tue, 2 Mar 2004 07:07:35 -0800 (PST)
- To: uri@w3.org
Adam Costello's example is an interesting one. I just sat down this morning to apply abnf2re to automatically generate parsing expressions from the bis4 grammar. > The latest draft says: > > An ABNF-driven parser will find that the border between > authority and path is ambiguous; they are disambiguated by the > "first-match-wins" (a.k.a. "greedy") algorithm. In other words, > if authority is present then the first segment of the path must be > empty. > > The second sentence does not follow from the first. Consider this URI: > > foo://joe@example.com:0x3FF/blah > > According to the grammar, this can be parsed in either of two ways: > > (1) authority = > path = //joe@example.com:0x3FF/blah > > (2) authority = joe@example.com > path = :0x3FF/blah > > It cannot be parsed this way: > > (3) authority = joe@example.com:0x3FF > path = /blah > > because non-digits are not allowed in the port. > > The first-match-wins rule implies that the correct parsing is (2). Note > that the first path segment is not empty, but is ":0x3FF". > Actually, the parsing results I get are as follows. >>> from URIbis4 import * >>> parseURI('foo://joe@example.com:0x3FF/blah') ('foo', 'joe@example.com:0', 'x3FF/blah', None, None) That is, the first-match-wins rule gives authority = joe@example.com:0 path = x3FF/blah Previous grammars produce a different kind of anomaly. >>> from URIbis3 import * >>> parseURI('foo://joe@example.com:0x3FF/blah') ('foo', None, '//joe@example.com:0x3FF/blah', None, None) Perhaps the following text is appropriate. "An ABNF-driven parser will find that the border between authority and path is ambiguous; they are disambiguated by the "first-match-wins" (a.k.a. "greedy") algorithm. This produces correct results whenever the authority is absent or the first segment of the path is empty. Although the grammar permits a nonempty path in the presence of an authority component, the URI is considered ill-formed in this case." > The regular expression in appendix B claims to break a well-formed URI > down into its components, but it gets this one wrong, yielding the > components in (3). > > Perhaps the grammar should be tightened up so that this URI is invalid. > Note that the RFC-2396 grammar does not accept it. > > If the grammar is kept as-is, the regular expression should be fixed to > parse this URI correctly, and the statement about the first path segment > being necessarily empty should be removed. That might have implications > for relative URI resolution... > > In any case, it might be nice for the draft to provide a regular > expression that not only parses well-formed URIs, but also detects > ill-formed URIs (by failing to match them). > This is of interest to me and is one of the reasons that I have developed abnf2re. The goal is to provide regular expressions that correspond exactly to the ABNF syntax within specification documents.
Received on Tuesday, 2 March 2004 10:07:39 UTC