Re: Feedback on Editor's Draft. from Seaborne, Andy on 2005-03-17 (public-rdf-dawg@w3.org from January to March 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Thu, 17 Mar 2005 10:20:19 +0000
To: "Thompson, Bryan B." <BRYAN.B.THOMPSON@saic.com>
Cc: '''Eric Prud'hommeaux ' ' ' <eric@w3.org>, "'public-rdf-dawg@w3.org '" <public-rdf-dawg@w3.org>
Message-ID: <423959E3.1070902@hp.com>
Thompson, Bryan B. wrote:
> Howard,
> 
> Your comments are quite to the point.  The problem is very much
> related to whitespace handling, which leads nicely into two
> underspecified aspects of the grammar:
> 
> 1. whitespace handling is not fully disclosed.  I believe that there is a
>    tacit assumption that whitespace is absorbed between tokens in the
>    "parser" section and is significant within tokens in the "lexer
>    section.

It is explained before the grammar.

> 
>    Other W3C specifications that specify grammars, e.g., the XML grammar,
>    do not have as much appearance of being a stripped down grammar from
>    some specific tool.  If you look at the XML grammar, you will see that
>    it makes explicit statements concerning whitespace in all productions.

And XQuery takes a different approach again.
http://www.w3.org/TR/xquery/#whitespace-rules

and uses comments in the EBNF to say where whitespace is not ignored.
All the tools I know (ANTLR included : Token.SKIP) have ways to act in this mode.

>    One way to say this is that it is entirely expressed at the lexer
>    level.  Another way to look at it is that it is less linked to the
>    assumptions of a specific parser generator technology.

It is not linked to a parser generator technology as Eric's work has shown. 
Otherwise I would just put in the javacc grammar I use for testing.

> 
> 2. case sensitivity is not fully disclosed.  I have assumed that
>    keywords are case-insensitive based on various examples in the
>    specification, but the lexical rules do not show this and the
>    introduction to the grammar does not spell it out.  Is there
>    anything else that is case insensitive?  E.g., are prefix names
>    case sensitive?

Keywords are case insensitive (except "a").
I'll see that the text for this is visible.

	Andy

> 
> Thanks,
> 
> -bryan
>    
> -----Original Message-----
> From: Howard Katz
> To: Thompson, Bryan B.; 'Seaborne, Andy '; public-rdf-dawg-request@w3.org
> Cc: ''Eric Prud'hommeaux ' '; public-rdf-dawg@w3.org
> Sent: 3/16/2005 8:01 PM
> Subject: RE: Feedback on Editor's Draft.
> 
> Bryan,
> 
> It probably doesn't help you much, but I had problems with qnames in
> antlr
> as well in early versions of my XQuery query engine. I too hoisted QNAME
> into the parser trying to solve lexer difficulties, but if I recall
> correctly, that then allowed users to enter spaces between the prefix,
> colon, and localPart! I eventually gave up (for other reasons as well)
> and
> eventually moved to javacc. I'm happier now (at least my analyst tells
> me I
> should be).
> 
> You got me curious and I went looking for antlr/QNAME productions. I've
> been
> away from antlr so long that the following xquery.g file from eXist just
> looks like gobbledeegook to me now. If it's useful, more power to you:
> 
> qName returns [String name]
> {
> 	name= null;
> 	String name2;
> }
> :
> 	( ncnameOrKeyword COLON ncnameOrKeyword )
> 	=> name=nc1:ncnameOrKeyword COLON name2=ncnameOrKeyword
> 	{
> 		name= name + ':' + name2;
> 		#qName.copyLexInfo(#nc1);
> 	}
> 	|
> 	name=ncnameOrKeyword
> 	;
> 
> Howard
> 
> 
>  > -----Original Message-----
>  > From: public-rdf-dawg-request@w3.org
>  > [mailto:public-rdf-dawg-request@w3.org]On Behalf Of Thompson, Bryan
> B.
>  > Sent: Wednesday, March 16, 2005 4:08 PM
>  > To: 'Seaborne, Andy '; 'public-rdf-dawg-request@w3.org '; Thompson,
>  > Bryan B.
>  > Cc: ''Eric Prud'hommeaux ' '; ''public-rdf-dawg@w3.org ' '
>  > Subject: RE: Feedback on Editor's Draft.
>  >
>  >
>  >
>  > Andy,
>  >
>  > With reference to the QNAME lexical production, the issue revolves
>  > around ambiguity after the ":" in a QNAME.  There is ambiguity
>  > between NCNAME1 (in the 17Feb05 working draft production) and pretty
>  > much all of the other lexical tokens, e.g., "select", "union", etc.
>  > This is because the ANTLR-generated parser / lexer is unable to
>  > differentiate between the end of the QNAME and a QNAME that continues
>  > to absorb characters.
>  >
>  > For example:
>  >
>  >  foo:select
>  >
>  > could be a QNAME ("foo:") and the keyword "select", or a single
>  > QMAME ( "foo:select" )  We need the parser context in order to
>  > differentiate between these cases.  It can't be done in the lexer
>  > alone (or without the use of lexical state, which is pretty much
>  > the same thing).
>  >
>  > I liked the old flex/lex model for managing lexical state from
>  > the parser.  ANTLR handles this ... differently.  E.g., with
>  > multiplexed token streams and with syntactic predicates for limited
>  > lookahead.
>  >
>  > I have actually hoisted the QNAME production into the parser in order
>  > to get the additional context required to make the parser decisions.
>  > I am currently trying to figure out if I accept ":" as a legal QNAME
>  > in the same fashion or if I need to change it around to use lexical
>  > state (by one mechanism or another).
>  >
>  > If there is any non-implementation specific lesson here, it is that
>  > there are lexer / parser interactions in the SPARQL grammar.  It is
>  > my guess that supporting Turtle (when I migrate to the editor's
> draft)
>  > will identify other such interactions.
>  >
>  > With respect to test cases, I hope to produce some more, but that has
>  > not been my focus at the moment.
>  >
>  > Thanks,
>  >
>  > -bryan
>  >
>  > -----Original Message-----
>  > From: public-rdf-dawg-request@w3.org
>  > To: Thompson, Bryan B.
>  > Cc: 'Eric Prud'hommeaux '; 'public-rdf-dawg@w3.org '
>  > Sent: 3/16/2005 1:35 PM
>  > Subject: Re: Feedback on Editor's Draft.
>  >
>  >
>  > Thompson, Bryan B. wrote:
>  > > Per Andy's request, I started on migration of the parser
>  > implementation
>  > > to the Editor's Draft of SPARQL.  I spent the morning on this and I
>  > have
>  > > summarized some questions below that showed up during that time.
>  > However,
>  > > I think that I am going to back off and continue with the last
> working
>  > > draft as the basis for my continuing efforts since I am more
>  > interested
>  > > in exploring SPARQL semantics, since migrating to the new grammar
> is
>  > > probably best done by a re-write (if I was really going to vet the
>  > > grammar in the Editor's Draft), and since I don't want to have to
>  > re-vet
>  > > the grammar multiple times as the draft is edited.
>  >
>  > The changes to the grammar should now be limited to anything coming
> out
>  > of the
>  > sorting discussions.  I hope you will continue to provide review and
>  > feedback -
>  > early working group feedback is very helpful.
>  >
>  >  > Finally, from the
>  > > perspective of semantics, most syntax changes (e.g., the turtle
>  > syntax)
>  > > are not a big deal and it feels like a lot of effort to track a
> moving
>  > > document.
>  > >
>  > > That said, I would be happy to do a migration to the Editor's draft
>  > > once it gets into a "feature freeze" state and before it is
> released
>  > > to last call.  At that time I should be able to provide feedback
> not
>  > > only on the grammar, but also on the semantics.
>  > >
>  > > Some questions on Editor's Draft.
>  > >
>  > > ? Production [3] specifies <SparqlParserBase>, which is not a
> defined
>  > >   lexical production.
>  >
>  > Fixed - a side effect of running cpp over the gramamr with -DBASE=...
>  > :-) which
>  > makes sure UNSAID does not creep back in.
>  >
>  > >
>  > > ? Production [56] (Q_URIRef) appears to have a whitespace character
> in
>  > >   the [^> ] expression so that a whitespace character is not
> permitted
>  > >   within the production.  However this is not clear on visual
>  > >   inspection of the production.
>  >
>  > ^ is "not" character - that expressions means "not space or >".
> Spaces
>  > can not
>  > appear in URIs.
>  >
>  > >
>  > > ? Production 57 (QNAME_NS) permits ":" as a valid QNAME_NS since
> the
>  > >   NCNAME_PREFIX is optional in the grammar.  Is this an error?  If
>  > >   not, it makes the PrefixDecl production ambiguous.
>  >
>  > Simplified to just the first rule.
>  >
>  > >
>  > > ? Production 58 (QNAME) reates an ambiguity in the grammar since
> QNAME
>  > >   permits "<QNAME_NS> :" without any trailing context.  This
> ambiguity
>  > >   can be resolved in several ways.  For example, by making the "(
>  > >   NCNAME1 | NCNAME2 )" production non-optional for QNAME.
>  >
>  > I think this is an ANTLR-ism.  Tokenizing in the usual flex/javacc
> way
>  > with
>  > greedy consumption of input does not have this problem as far as I
> know.
>  > I have
>  > made a change that should remove it anyway. [*] and see below.
>  >
>  > Aside: as you are using ANTLR, you can either do syntactic or
> semantic
>  > lookahead
>  > but then you may wish to make more wholesale changes to the token
> rules
>  > and
>  > reduce the number of token productions anyway.
>  >
>  > >
>  > > ? Production 58 (QNAME) would allow ":foo" as a QName.  This is NOT
> a
>  > >   legal XML QName.  If the intention is to permit such
> constructions,
>  > >   then the use of "QName" may prove confusing to implementors.
>  >
>  > ":foo" is legal as is "foo:" and ":"  Yes, they are not XML QNames.
> But
>  > they are
>  > so widely referred to as qnames in the semantic web community, it
> would
>  > also be
>  > confusing to invent a new term.
>  >
>  > >
>  > > ? Production 51 (QName) This production causes conflicts in the
>  > >   grammar.  I modified the production to "(NCNAME_PREFIX)? COLON (
>  > >   NCNAME1 | NCNAME2 )", which requires something after the COLON
> and
>  > >   which I believe supports the uses of QName in the grammar.
>  >
>  > [*] This is related to the above.
>  >
>  > I modifed QNAME (not the grammar rule QName) along the lines
> suggested.
>  >
>  > I defined token NCNAME as (NCNAME1 | NCNAME2) and used that through
> out.
>  >
>  > Aside: NCNAME1 and NCNAME2 are with and without leading "_" because
> only
>  > one
>  > kind is legal for prefixes, but both are local names.  qnames can't
>  > start with _
>  > because that looks like a blank node.  Other fun and games to exclude
>  > trailing
>  > dots in qnames as WG decision.
>  >
>  > >
>  > > ? Productions 59 (BNODE) and 60 (BNODE_LABEL) are identical.  Note
>  > >   that production 59 (BNODE) is not used and should presumably be
>  > >   dropped.
>  >
>  > Removed BNODE - I had changed the name and didn't remove the
> definition
>  > in the
>  > formatting system.
>  >
>  > >
>  > > Thanks,
>  > >
>  > > -bryan
>  > >
>  >
>  > Thanks for the feedback. I'll need to go back and check but with the
>  > changes I
>  > described, the grammar passes by syntax tests I have.
>  >
>  > Bryan (and anyone else) - do you have any syntax test cases?  If so,
> I'd
>  > be
>  > happy to collect them all together, or you can add them to test DAWG
>  > test suite.
>  >
>  > 	Andy
>  >
>  >
>
Received on Thursday, 17 March 2005 10:20:33 UTC