N3 and Turtle grammars

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 19 Jun 2006 16:12:03 +0100
Message-ID: <DF5E364A470421429AE6DC96979A4F6FAFF5E6@sdcexc04.emea.cpqcorp.net>
To: <public-cwm-talk@w3.org>
Some experiences while trying to write a parser for Turtle: I had hoped
to have a combined N3/Turtle parser with a switch to restrict to Turtle.
This is beginning to look hard/impossible because of #1 and #2 (well -
nothing is impossible, it just means the work has to moved out of the
parser into a late rproicessing stage).

My current development Turtle grammar is attached - it passes the Turtle
test suite but I don't consider it finished.  It's extracted from SPARQL
so it allows dots inside qnames.

== #1 : Tokenizing

I'm using javacc - it generates LL parsers with a separate tokenizer.
So it works by reading the input stream, identifying as the stream is
read.    Like flex/lex - this is the common way.

It means that tokens are identified without regard to where in the
production rules the parser is currently.

Javacc does support context-sensitive tokeizing (can switch between
different tokeizing sets based on parser control).  I don't mind using
the tricky bits but in SPARQL there was a big push not to go there.

Tokens are whitespace sensitive ; the parser does not need to be (other
that whitespace splits terminals).

The tricky case in N3 is language codes.  I did it in Turtle and SPARQL
by using the @ so a language code token includes the @ ; the token for
language code and for "a" are different.

In n3.n3, I see:

langcode	cfg:matches  	"[a-z]+(-[a-z0-9]+)*";
		cfg:canStartWith 	"a".

and don't see how to handle this without using context senstive
tokenizing (s special state for langcodes).  javacc has such a feature :
flex does as well but it makes life just plain tricky.  The alternative
is have some bland token for any sequence of bare letters without a ":"
and test elsewhere.

How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from "a"
the langcode?

Aside: I see, in n3.n3:

# - @keywords affects tokenizing

Isn't this the same thing as typedef's in C where the token tables
change as the language is parsed? I don't know how to handle this in
javacc nor antlr.

== #2 : align whitespace between N3 and Turtle

This is not legal Turtle:

<a><b><c> .

[4]	triples 	::= 	subject ws+ predicateObjectList

because it has no whitespace between the subject/predicate.  But it is
N3 and is reasonable RDF.  It also means the parser itself can't be
whitespace independent, leaving whitespace handling to the lexer to
merely split terminals as necessary.

== #3 \u escapes

Long form: (do not do it like SPARQL!)

Short form:
The problem is what happens if a significant character is used.

SPARQL example:  ?ac\u0020

The parser has to look at this twice: once to see a variable, then
validate it again after esacape processing to check no illegal
characters have appeared.  In this case, a space has been put into
variable name which is illegal.

Turtle avoids this by:
1/ Having several .character rules
2/ Not allowing \u in places like qnames where the full range of Unicode
is otherwsie allowed which isn't internationalization-friendly.
3/ Having ranges like 
[39]	ucharacter 	::= 	( character - #x3E ) | '\>'
which are mixing a parser rule and token character 

So \u003E puts a < into a URI because the character rule accepts \u003E
There is text to further modify \u legality but (and this is the SPARQL
problem) the rules can't be expressed in a formal grammar.

Not sure where N3 is with \u escapes.


Define processing as:

1/ apply \u escaping at the lowest level - applies to the input stream
so by the end of this, the parser does not see \u as an escape sequence.
\u works everywhere 

At this point we have a stream of characters or UTF-8 depending on your
toolkit technology.

2/ Tokenizing - to create a stream of tokens (usually done lazily)

3/ Parsing - apply the grammar

then \u processing does not need special text or special cases.

== Odd and Ends from n3.n3:

    explicituri 	cfg:matches 	"<[^>]*>";

That includes newlines inside IRIs

The qname name token says (removed the \u stuff:)


which makes the ":" optional.

