Re: [uri] <none> from Rob Cameron on 2003-06-11 (uri@w3.org from June 2003)

From: Rob Cameron <cameron@cs.sfu.ca>
Date: Wed, 11 Jun 2003 12:31:42 -0700
To: "Roy T. Fielding" <fielding@apache.org>, "Mark Thomson" <marktt@excite.com>
Cc: uri@w3.org
Message-Id: <200306111231.42386.cameron@cs.sfu.ca>

On June 9, 2003 04:18 pm, Roy T. Fielding wrote:
> > "A path is always defined for a URI, though the defined path may be
> > empty (zero length) or opaque (not containing any "/" delimiters)"
> >
> > The production for net-path says that abs-path is optional, so for a
> > URI like http://ABCD?query, we have both abs-path and rel-path
> > undefined and not empty and therefore path would be undefined. Do we
> > still have to assume that path is empty even when both abs-path and
> > rel-path are undefined ? or is the above statement from the draft
> > incorrect ?
>
> Bummer.  The statement is correct, but I'll need to fix the ABNF so
> that it always ends up with a matching production.
>
> Thanks for the report,
>
> ....Roy

I've been playing with an experimental grammar modification that
addresses this problem and also addresses the following 
additional wrinkle:   http://ABCD+y is a legal URI according to
the ABNF (as translated to regexps by abnf2re).

>>> parseURI('http://ABCD+y')
('http', None, '//ABCD+y', None, None)

That is, because ABCD+y is not  a legal authority, the
regular expression matching rules for http://ABCD+y backtrack
to accept //ABCD+y as a path.    

To address both the problem reported by Mark and the
problem above, I have found that there may be merit
to simplifying the URI production to directly reflect the
opening statement of section 3:

"The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment."

URI  = scheme ":" ["//" authority] path [ "?" query ] [ "#" fragment ]

This rule reflects the five-component structure and the statement
that a path always exists, even if it is empty.   It can be made
to work with either of the two following definitions of path:

path = abs-path / rel-path
path = segment *( "/" segment )

Running a parser based on either of these changes with
all the test cases listed in section 5.4 (both normal and 
abnormal examples) gives precisely the same results as 
with the grammar of bis-02 or bis-03.   (By the way, it
might be good to have some IPv6 literals in the test
examples.)

On the problem case of http://ABCD+y, the following results.
>>> parseURI('http://ABCD+y')
('http', 'ABCD', '+y', None, None)

Arguably, this is a better parse if http://ABCD+y is to be 
accepted as a URI.   It is also a better parse if http://ABCD+y
is to be ruled out by the extra-grammatical restriction: "when
an authority exists, the path must either be empty or an
abs-path."  (Alternatively, "when an authority exists, the 
first segment of the path must be empty.")   

Overall, I think the theme of grammar simplification reflected in the
change from bis02 to bis03 is a good idea.   One other
area that could use some attention is the grammar of IPv6
literals.

Received on Wednesday, 11 June 2003 15:31:48 UTC