W3C home > Mailing lists > Public > public-cwm-talk@w3.org > April to June 2006

RE: N3 and Turtle grammars

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Tue, 20 Jun 2006 10:22:24 +0100
Message-ID: <DF5E364A470421429AE6DC96979A4F6FAFF6B0@sdcexc04.emea.cpqcorp.net>
To: "Dan Connolly" <connolly@w3.org>
Cc: <public-cwm-talk@w3.org>, <navyrain@navyrain.net>




-------- Original Message --------
> From: Dan Connolly <mailto:connolly@w3.org>
> Date: 19 June 2006 17:20
> 
> On Mon, 2006-06-19 at 16:12 +0100, Seaborne, Andy wrote:
> > Some experiences while trying to write a parser for Turtle: I had
> > hoped to have a combined N3/Turtle parser with a switch to restrict
> > to Turtle. This is beginning to look hard/impossible because of #1
> > and #2 (well - nothing is impossible, it just means the work has to
> > moved out of the 
> > parser into a late rproicessing stage).
> 
> I think that this goal is important enough that we should change the
> languages to make it feasible... 
> 
> > My current development Turtle grammar is attached - it passes the
> > Turtle test suite but I don't consider it finished.  It's extracted
> > from SPARQL so it allows dots inside qnames.
> > 
> > == #1 : Tokenizing
> [...]
> > 
> > In n3.n3, I see:
> > 
> > langcode	cfg:matches  	"[a-z]+(-[a-z0-9]+)*";
> > 		cfg:canStartWith 	"a".
> 
> I just hit that issue in the derivative of n3.n3 that I'm working on.
>   http://www.w3.org/2000/10/swap/grammar/notation3.bnf
> 
> I changed it so that the @ is part of the langcode token:
> 
> [33] langcode	::= "@" [a-z]+ ("-" [a-z0-9]+)*
> 
> I haven't convinced timbl to move away from n3.n3 as the "truth" yet,
> but the discussion has started, and I'm getting there. 

I also match "@prefix" in preference to a langcode. The tokenizer I use
returns first match by the ordering in the grammar file.  This could be
done for a fixed set of keyworks but "@is" is Icelandic.

> 
> My current target is a JavaScript parser based on
> http://www.navyrain.net/compilergeneratorinjavascript/ 
> 
> I've got python code that converts the .bnf to turtle, then runs some
> N3 rules on the turtle to simplify the grammar, then reads the result
> and simplifies it to a JSON structure and prints that out.  
> 
> The Makefile has the details...
> 
> [[
> ebnf: notation3.rdf ebnf.rdf notation3.json
> 
> notation3.json: notation3-bnf.n3 gramLL1.py
> 	PYTHONPATH=$(HOME)/lib/python:../.. $P gramLL1.py
notation3-bnf.n3 >$@
> 
> CHATTY=0
> 
> notation3-bnf.n3: notation3.n3 ebnf2bnf.n3 first_follow.n3
> 	$P $C notation3.n3 ebnf2bnf.n3  --chatty=$(CHATTY) \
> 		--think --data >$@
> 
> notation3.n3: notation3.bnf ebnf2turtle.py
> 	$P ebnf2turtle.py notation3.bnf n3
> 'http://www.w3.org/2000/10/swap/grammar/notation3#' >$@ 
> 
> ebnf.rdf: ebnf.n3
> notation3.rdf: notation3.n3
> ]]
>  -- http://www.w3.org/2000/10/swap/grammar/Makefile
> 
> 
> 
> 
> > How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from
> > "a" 
> > the langcode?
> 
> With a hand-crafted python lexer/parser that I want to get rid of.
> http://www.w3.org/2000/10/swap/notation3.py

Ah - I thought it was derived from n3.n3.

It looks to be a recursive descent parser which uses context senstive
tokensets (it looks from the grammar parser for particualt tokens based
on where it is in the code).

Test: 
  <x> <x> => .
fails because => is not in the right position.

> 
> > [[
> > Aside: I see, in n3.n3:
> > 
> > # - @keywords affects tokenizing
> > 
> > Isn't this the same thing as typedef's in C where the token tables
> > change as the language is parsed? I don't know how to handle this in
> > javacc nor antlr.
> > ]]
> 
> I haven't worked all the way thru that issue yet either.
> 
> 
> > == #2 : align whitespace between N3 and Turtle
> > 
> > This is not legal Turtle:
> > 
> > <a><b><c> .
> > 
> > by:
> > [4]	triples 	::= 	subject ws+ predicateObjectList
> > 
> > because it has no whitespace between the subject/predicate.  But it
is
> > N3 and is reasonable RDF.  It also means the parser itself can't be
> > whitespace independent, leaving whitespace handling to the lexer to
> > merely split terminals as necessary.
> 
> Let's please change turtle there. Dave?
> 
> 
> > == #3 \u escapes
> > 
> > Long form: (do not do it like SPARQL!)
> >
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0443.ht
> > ml
> 
> I'm ignoring \u escapes at least for now, sorta hoping they'll go away
> ;-) 

:-)

> 
> In that sense, I find it appealing to treat \u escapes in a layer
> that's separate from lexical analysis and parsing... 
> 
> > Suggestion:
> > 
> > Define processing as:
> > 
> > 1/ apply \u escaping at the lowest level - applies to the input
stream
> > so by the end of this, the parser does not see \u as an escape
> > sequence. \u works everywhere 
> > 
> > At this point we have a stream of characters or UTF-8 depending on
> > your toolkit technology.
> > 
> > 2/ Tokenizing - to create a stream of tokens (usually done lazily)
> > 
> > 3/ Parsing - apply the grammar
> > 
> > then \u processing does not need special text or special cases.
> > 
> > 
> > == Odd and Ends from n3.n3:
> > 
> >     explicituri 	cfg:matches 	"<[^>]*>";
> > 
> > That includes newlines inside IRIs
> 
> I changed that in notation3.bnf (and hence
> notation3.n3 and notation3.rdf ) just the other day.
> 
> > The qname name token says (removed the \u stuff:)
> > 
> >    (([A-Z_a-z][\\-0-9A-Z_a-z]*)?:)?[A-Z_a-][\\-0-9A-Z_a-]*
> > 
> > which makes the ":" optional.
> 
> Yes, that's closely connected to the @keywords issues. In N3, you can
> write terms without the colons: 
> 
>  @keywords is, of a.
>  @prefix : <#>.
> 
>  sky color blue.
> 
> I can't remember if this is documented... ah yes... see section
> 
>   Getting rid of the leading ":" with @keywords of
> http://www.w3.org/2000/10/swap/doc/Shortcuts.html 

I confess to not being convinced by this natural language feel - it
seems to be neither one thing nor another.

But one approach would be to layer another (simple) language over the
top of a core N3 whose output is N3.

== #4 Signed numbers

There is another one I just remembered.

In N3 and Turtle, the sign of a signed number must have no whitespace
bewteen it and the digits.  Signed number parsing can be done by the
tokenizer (actually, with whitespace it could as well but that ) 

In SPARQL, "-3" and "- 3" are an expression; the "-" is a unary operator
and it's handled in the grammar.  It's like programming languages I
tried although that does not make it necessarily a good idea because of
the use of signed numbers in triple pattern literals.

Since then, I have worked out how to deal with N3/Turtle "no white
space" signed numbers in the SPARQL grammar (it is not in there as it's
CR).  If there were syntax changes anyway in SPARQL, I'd suggest not
allowing the "- 3" form in literals in SPARQL triple patterns.

	Andy

> 
> 
> 
> --
> Dan Connolly, W3C http://www.w3.org/People/Connolly/
> D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Tuesday, 20 June 2006 09:25:26 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:11:02 GMT