Re: N3 and Turtle grammars from Dan Connolly on 2006-06-19 (public-cwm-talk@w3.org from April to June 2006)

From: Dan Connolly <connolly@w3.org>
Date: Mon, 19 Jun 2006 11:20:25 -0500
To: "Seaborne, Andy" <andy.seaborne@hp.com>
Cc: public-cwm-talk@w3.org, navyrain@navyrain.net
Message-Id: <1150734025.19088.114.camel@dirk.w3.org>
On Mon, 2006-06-19 at 16:12 +0100, Seaborne, Andy wrote:
> Some experiences while trying to write a parser for Turtle: I had hoped
> to have a combined N3/Turtle parser with a switch to restrict to Turtle.
> This is beginning to look hard/impossible because of #1 and #2 (well -
> nothing is impossible, it just means the work has to moved out of the
> parser into a late rproicessing stage).

I think that this goal is important enough that we should change
the languages to make it feasible...

> My current development Turtle grammar is attached - it passes the Turtle
> test suite but I don't consider it finished.  It's extracted from SPARQL
> so it allows dots inside qnames.
> 
> == #1 : Tokenizing
[...]
> 
> In n3.n3, I see:
> 
> langcode	cfg:matches  	"[a-z]+(-[a-z0-9]+)*";
> 		cfg:canStartWith 	"a".

I just hit that issue in the derivative of n3.n3 that I'm working on.
  http://www.w3.org/2000/10/swap/grammar/notation3.bnf

I changed it so that the @ is part of the langcode token:

[33] langcode	::= "@" [a-z]+ ("-" [a-z0-9]+)*

I haven't convinced timbl to move away from n3.n3 as the "truth" yet,
but the discussion has started, and I'm getting there.

My current target is a JavaScript parser based on
http://www.navyrain.net/compilergeneratorinjavascript/

I've got python code that converts the .bnf to turtle,
then runs some N3 rules on the turtle to simplify the
grammar, then reads the result and simplifies it
to a JSON structure and prints that out.

The Makefile has the details...

[[
ebnf: notation3.rdf ebnf.rdf notation3.json

notation3.json: notation3-bnf.n3 gramLL1.py
	PYTHONPATH=$(HOME)/lib/python:../.. $P gramLL1.py notation3-bnf.n3 >$@

CHATTY=0

notation3-bnf.n3: notation3.n3 ebnf2bnf.n3 first_follow.n3
	$P $C notation3.n3 ebnf2bnf.n3  --chatty=$(CHATTY) \
		--think --data >$@

notation3.n3: notation3.bnf ebnf2turtle.py
	$P ebnf2turtle.py notation3.bnf n3 'http://www.w3.org/2000/10/swap/grammar/notation3#' >$@

ebnf.rdf: ebnf.n3
notation3.rdf: notation3.n3
]]
 -- http://www.w3.org/2000/10/swap/grammar/Makefile




> How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from "a"
> the langcode?

With a hand-crafted python lexer/parser that I want to get rid of.
http://www.w3.org/2000/10/swap/notation3.py

> [[
> Aside: I see, in n3.n3:
> 
> # - @keywords affects tokenizing
> 
> Isn't this the same thing as typedef's in C where the token tables
> change as the language is parsed? I don't know how to handle this in
> javacc nor antlr.
> ]]

I haven't worked all the way thru that issue yet either.


> == #2 : align whitespace between N3 and Turtle
> 
> This is not legal Turtle:
> 
> <a><b><c> .
> 
> by:
> [4]	triples 	::= 	subject ws+ predicateObjectList
> 
> because it has no whitespace between the subject/predicate.  But it is
> N3 and is reasonable RDF.  It also means the parser itself can't be
> whitespace independent, leaving whitespace handling to the lexer to
> merely split terminals as necessary.

Let's please change turtle there. Dave?


> == #3 \u escapes
> 
> Long form: (do not do it like SPARQL!)
> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0443.html

I'm ignoring \u escapes at least for now, sorta hoping they'll go
away ;-)

In that sense, I find it appealing to treat \u escapes in a layer
that's separate from lexical analysis and parsing...

> Suggestion:
> 
> Define processing as:
> 
> 1/ apply \u escaping at the lowest level - applies to the input stream
> so by the end of this, the parser does not see \u as an escape sequence.
> \u works everywhere 
> 
> At this point we have a stream of characters or UTF-8 depending on your
> toolkit technology.
> 
> 2/ Tokenizing - to create a stream of tokens (usually done lazily)
> 
> 3/ Parsing - apply the grammar
> 
> then \u processing does not need special text or special cases.
> 
> 
> == Odd and Ends from n3.n3:
> 
>     explicituri 	cfg:matches 	"<[^>]*>";
> 
> That includes newlines inside IRIs

I changed that in notation3.bnf (and hence
notation3.n3 and notation3.rdf ) just the other day.

> The qname name token says (removed the \u stuff:)
> 
>    (([A-Z_a-z][\\-0-9A-Z_a-z]*)?:)?[A-Z_a-][\\-0-9A-Z_a-]*
> 
> which makes the ":" optional.

Yes, that's closely connected to the @keywords issues. In
N3, you can write terms without the colons:

 @keywords is, of a.
 @prefix : <#>.

 sky color blue.

I can't remember if this is documented... ah yes... see
section

  Getting rid of the leading ":" with @keywords
of http://www.w3.org/2000/10/swap/doc/Shortcuts.html



-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Monday, 19 June 2006 16:20:36 UTC