Re: N3 and Turtle grammars

On Mon, 2006-06-19 at 16:12 +0100, Seaborne, Andy wrote:
> Some experiences while trying to write a parser for Turtle: I had hoped
> to have a combined N3/Turtle parser with a switch to restrict to Turtle.
> This is beginning to look hard/impossible because of #1 and #2 (well -
> nothing is impossible, it just means the work has to moved out of the
> parser into a late rproicessing stage).

I think that this goal is important enough that we should change
the languages to make it feasible...

> My current development Turtle grammar is attached - it passes the Turtle
> test suite but I don't consider it finished.  It's extracted from SPARQL
> so it allows dots inside qnames.
> == #1 : Tokenizing
> In n3.n3, I see:
> langcode	cfg:matches  	"[a-z]+(-[a-z0-9]+)*";
> 		cfg:canStartWith 	"a".

I just hit that issue in the derivative of n3.n3 that I'm working on.

I changed it so that the @ is part of the langcode token:

[33] langcode	::= "@" [a-z]+ ("-" [a-z0-9]+)*

I haven't convinced timbl to move away from n3.n3 as the "truth" yet,
but the discussion has started, and I'm getting there.

My current target is a JavaScript parser based on

I've got python code that converts the .bnf to turtle,
then runs some N3 rules on the turtle to simplify the
grammar, then reads the result and simplifies it
to a JSON structure and prints that out.

The Makefile has the details...

ebnf: notation3.rdf ebnf.rdf notation3.json

notation3.json: notation3-bnf.n3
	PYTHONPATH=$(HOME)/lib/python:../.. $P notation3-bnf.n3 >$@


notation3-bnf.n3: notation3.n3 ebnf2bnf.n3 first_follow.n3
	$P $C notation3.n3 ebnf2bnf.n3  --chatty=$(CHATTY) \
		--think --data >$@

notation3.n3: notation3.bnf
	$P notation3.bnf n3 '' >$@

ebnf.rdf: ebnf.n3
notation3.rdf: notation3.n3

> How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from "a"
> the langcode?

With a hand-crafted python lexer/parser that I want to get rid of.

> [[
> Aside: I see, in n3.n3:
> # - @keywords affects tokenizing
> Isn't this the same thing as typedef's in C where the token tables
> change as the language is parsed? I don't know how to handle this in
> javacc nor antlr.
> ]]

I haven't worked all the way thru that issue yet either.

> == #2 : align whitespace between N3 and Turtle
> This is not legal Turtle:
> <a><b><c> .
> by:
> [4]	triples 	::= 	subject ws+ predicateObjectList
> because it has no whitespace between the subject/predicate.  But it is
> N3 and is reasonable RDF.  It also means the parser itself can't be
> whitespace independent, leaving whitespace handling to the lexer to
> merely split terminals as necessary.

Let's please change turtle there. Dave?

> == #3 \u escapes
> Long form: (do not do it like SPARQL!)

I'm ignoring \u escapes at least for now, sorta hoping they'll go
away ;-)

In that sense, I find it appealing to treat \u escapes in a layer
that's separate from lexical analysis and parsing...

> Suggestion:
> Define processing as:
> 1/ apply \u escaping at the lowest level - applies to the input stream
> so by the end of this, the parser does not see \u as an escape sequence.
> \u works everywhere 
> At this point we have a stream of characters or UTF-8 depending on your
> toolkit technology.
> 2/ Tokenizing - to create a stream of tokens (usually done lazily)
> 3/ Parsing - apply the grammar
> then \u processing does not need special text or special cases.
> == Odd and Ends from n3.n3:
>     explicituri 	cfg:matches 	"<[^>]*>";
> That includes newlines inside IRIs

I changed that in notation3.bnf (and hence
notation3.n3 and notation3.rdf ) just the other day.

> The qname name token says (removed the \u stuff:)
>    (([A-Z_a-z][\\-0-9A-Z_a-z]*)?:)?[A-Z_a-][\\-0-9A-Z_a-]*
> which makes the ":" optional.

Yes, that's closely connected to the @keywords issues. In
N3, you can write terms without the colons:

 @keywords is, of a.
 @prefix : <#>.

 sky color blue.

I can't remember if this is documented... ah yes... see

  Getting rid of the leading ":" with @keywords

Dan Connolly, W3C
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E

Received on Monday, 19 June 2006 16:20:36 UTC