- From: Adam M. Costello BOGUS address, see signature <BOGUS@BOGUS.nicemice.net>
- Date: Tue, 2 Mar 2004 22:10:10 +0000
- To: uri@w3.org
Rob Cameron <cameron@cs.sfu.ca> wrote: > Actually, the parsing results I get are as follows. > > >>> from URIbis4 import * > >>> parseURI('foo://joe@example.com:0x3FF/blah') > ('foo', 'joe@example.com:0', 'x3FF/blah', None, None) > > That is, the first-match-wins rule gives > authority = joe@example.com:0 > path = x3FF/blah Sure enough. Oops. I'm more convinced than ever that this URI should be invalid, because the boundary between authority and path is so well camouflaged. > This is of interest to me and is one of the reasons that I have > developed abnf2re. It sounds cool. Is it available? Google finds only this mailing list. "Roy T. Fielding" <fielding@gbiv.com> wrote: > Note that the text already says a more forceful equivalent in section > 3.3: > > If a URI contains an authority component, then the initial path > segment must be empty (i.e., the path must begin with a slash ("/") > character or be entirely empty). > > Is that not sufficient? I think defining the syntax of an element (or a sub-element) using a combination of a loose grammar and further restrictive prose is an invitation for misunderstanding. The reader is likely to think the prose is merely stating a fact implied by the grammar, rather than adding rules not already expressed in the grammar. (In this case, I thought the prose was stating an implication of the first-match-wins rule, but I was mistaken.) I think it's safer to either define the (sub-)element entirely by prose (omit the grammar) or use a tight grammar along with prose that merely provides intuition about the self-sufficient grammar. But I now see the motivation for the other approach. A tight grammar is suitable for defining a syntax (that is, distinguishing between valid and invalid strings) but is inconvenient for decomposing strings into their components (because one component, like path, might match any one of various tokens, like abs-path, opaque-part, or rel-path, so you need to detect which token was matched and assign it to the component). Conversely, a grammar convenient for decomposing a string into its components (where a given component is always matched by a unique token) will be loose (will accept invalid strings). Perhaps specs should provide both grammars, and implementations should do something like: if string matches validation_regex then match string against decomposition_regex else report error RFC-2396 provides a validation grammar, and the current draft of 2396bis provides a decomposition grammar. I think both grammars are valuable. If either grammar is omitted from the spec, implementors are likely to roll their own and make subtle mistakes. If the spec omits the validation grammar, some implementations are likely to do only decomposition, not validation, causing invalid URIs to seem to work reasonably with some implementations and not others, which invites interoperability problems and makes it difficult to extend the syntax in the future (because some implementations already accept the extended syntax and do who-knows-what with it). It's inconvenient but possible to use a validation grammar for decomposition, but it's not possible to do validation with a decomposition grammar. Because of the importance of validation for interoperability and for extensibility of the syntax, I think if a spec provides only one grammar it should be a validation grammar, but it might be better to provide both. Maybe the decomposition grammer need not be provided in ABNF form, only regex form, whereas the validation grammar would be provided in both forms. AMC http://www.nicemice.net/amc/
Received on Tuesday, 2 March 2004 17:10:13 UTC