Re: grammar weirdness

Rob Cameron <cameron@cs.sfu.ca> wrote:

> Actually, the parsing results I get are as follows.
> 
> >>> from URIbis4 import *
> >>> parseURI('foo://joe@example.com:0x3FF/blah')
> ('foo', 'joe@example.com:0', 'x3FF/blah', None, None)
> 
> That is, the first-match-wins rule gives
>    authority = joe@example.com:0
>    path = x3FF/blah

Sure enough.  Oops.

I'm more convinced than ever that this URI should be invalid, because
the boundary between authority and path is so well camouflaged.

> This is of interest to me and is one of the reasons that I have
> developed abnf2re.

It sounds cool.  Is it available?  Google finds only this mailing list.

"Roy T. Fielding" <fielding@gbiv.com> wrote:

> Note that the text already says a more forceful equivalent in section
> 3.3:
>
>    If a URI contains an authority component, then the initial path
>    segment must be empty (i.e., the path must begin with a slash ("/")
>    character or be entirely empty).
>
> Is that not sufficient?

I think defining the syntax of an element (or a sub-element) using a
combination of a loose grammar and further restrictive prose is an
invitation for misunderstanding.  The reader is likely to think the
prose is merely stating a fact implied by the grammar, rather than
adding rules not already expressed in the grammar.  (In this case, I
thought the prose was stating an implication of the first-match-wins
rule, but I was mistaken.)  I think it's safer to either define the
(sub-)element entirely by prose (omit the grammar) or use a tight
grammar along with prose that merely provides intuition about the
self-sufficient grammar.

But I now see the motivation for the other approach.  A tight grammar is
suitable for defining a syntax (that is, distinguishing between valid
and invalid strings) but is inconvenient for decomposing strings into
their components (because one component, like path, might match any
one of various tokens, like abs-path, opaque-part, or rel-path, so you
need to detect which token was matched and assign it to the component).
Conversely, a grammar convenient for decomposing a string into its
components (where a given component is always matched by a unique token)
will be loose (will accept invalid strings).  Perhaps specs should
provide both grammars, and implementations should do something like:

    if string matches validation_regex
    then match string against decomposition_regex
    else report error

RFC-2396 provides a validation grammar, and the current draft of
2396bis provides a decomposition grammar.  I think both grammars are
valuable.  If either grammar is omitted from the spec, implementors
are likely to roll their own and make subtle mistakes.  If the spec
omits the validation grammar, some implementations are likely to do
only decomposition, not validation, causing invalid URIs to seem to
work reasonably with some implementations and not others, which invites
interoperability problems and makes it difficult to extend the syntax
in the future (because some implementations already accept the extended
syntax and do who-knows-what with it).

It's inconvenient but possible to use a validation grammar for
decomposition, but it's not possible to do validation with a
decomposition grammar.  Because of the importance of validation for
interoperability and for extensibility of the syntax, I think if a spec
provides only one grammar it should be a validation grammar, but it
might be better to provide both.  Maybe the decomposition grammer need
not be provided in ABNF form, only regex form, whereas the validation
grammar would be provided in both forms.

AMC
http://www.nicemice.net/amc/

Received on Tuesday, 2 March 2004 17:10:13 UTC