- From: Dan Connolly <connolly@w3.org>
- Date: Mon, 19 Jun 2006 11:20:25 -0500
- To: "Seaborne, Andy" <andy.seaborne@hp.com>
- Cc: public-cwm-talk@w3.org, navyrain@navyrain.net
On Mon, 2006-06-19 at 16:12 +0100, Seaborne, Andy wrote:
> Some experiences while trying to write a parser for Turtle: I had hoped
> to have a combined N3/Turtle parser with a switch to restrict to Turtle.
> This is beginning to look hard/impossible because of #1 and #2 (well -
> nothing is impossible, it just means the work has to moved out of the
> parser into a late rproicessing stage).
I think that this goal is important enough that we should change
the languages to make it feasible...
> My current development Turtle grammar is attached - it passes the Turtle
> test suite but I don't consider it finished. It's extracted from SPARQL
> so it allows dots inside qnames.
>
> == #1 : Tokenizing
[...]
>
> In n3.n3, I see:
>
> langcode cfg:matches "[a-z]+(-[a-z0-9]+)*";
> cfg:canStartWith "a".
I just hit that issue in the derivative of n3.n3 that I'm working on.
http://www.w3.org/2000/10/swap/grammar/notation3.bnf
I changed it so that the @ is part of the langcode token:
[33] langcode ::= "@" [a-z]+ ("-" [a-z0-9]+)*
I haven't convinced timbl to move away from n3.n3 as the "truth" yet,
but the discussion has started, and I'm getting there.
My current target is a JavaScript parser based on
http://www.navyrain.net/compilergeneratorinjavascript/
I've got python code that converts the .bnf to turtle,
then runs some N3 rules on the turtle to simplify the
grammar, then reads the result and simplifies it
to a JSON structure and prints that out.
The Makefile has the details...
[[
ebnf: notation3.rdf ebnf.rdf notation3.json
notation3.json: notation3-bnf.n3 gramLL1.py
PYTHONPATH=$(HOME)/lib/python:../.. $P gramLL1.py notation3-bnf.n3 >$@
CHATTY=0
notation3-bnf.n3: notation3.n3 ebnf2bnf.n3 first_follow.n3
$P $C notation3.n3 ebnf2bnf.n3 --chatty=$(CHATTY) \
--think --data >$@
notation3.n3: notation3.bnf ebnf2turtle.py
$P ebnf2turtle.py notation3.bnf n3 'http://www.w3.org/2000/10/swap/grammar/notation3#' >$@
ebnf.rdf: ebnf.n3
notation3.rdf: notation3.n3
]]
-- http://www.w3.org/2000/10/swap/grammar/Makefile
> How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from "a"
> the langcode?
With a hand-crafted python lexer/parser that I want to get rid of.
http://www.w3.org/2000/10/swap/notation3.py
> [[
> Aside: I see, in n3.n3:
>
> # - @keywords affects tokenizing
>
> Isn't this the same thing as typedef's in C where the token tables
> change as the language is parsed? I don't know how to handle this in
> javacc nor antlr.
> ]]
I haven't worked all the way thru that issue yet either.
> == #2 : align whitespace between N3 and Turtle
>
> This is not legal Turtle:
>
> <a><b><c> .
>
> by:
> [4] triples ::= subject ws+ predicateObjectList
>
> because it has no whitespace between the subject/predicate. But it is
> N3 and is reasonable RDF. It also means the parser itself can't be
> whitespace independent, leaving whitespace handling to the lexer to
> merely split terminals as necessary.
Let's please change turtle there. Dave?
> == #3 \u escapes
>
> Long form: (do not do it like SPARQL!)
> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0443.html
I'm ignoring \u escapes at least for now, sorta hoping they'll go
away ;-)
In that sense, I find it appealing to treat \u escapes in a layer
that's separate from lexical analysis and parsing...
> Suggestion:
>
> Define processing as:
>
> 1/ apply \u escaping at the lowest level - applies to the input stream
> so by the end of this, the parser does not see \u as an escape sequence.
> \u works everywhere
>
> At this point we have a stream of characters or UTF-8 depending on your
> toolkit technology.
>
> 2/ Tokenizing - to create a stream of tokens (usually done lazily)
>
> 3/ Parsing - apply the grammar
>
> then \u processing does not need special text or special cases.
>
>
> == Odd and Ends from n3.n3:
>
> explicituri cfg:matches "<[^>]*>";
>
> That includes newlines inside IRIs
I changed that in notation3.bnf (and hence
notation3.n3 and notation3.rdf ) just the other day.
> The qname name token says (removed the \u stuff:)
>
> (([A-Z_a-z][\\-0-9A-Z_a-z]*)?:)?[A-Z_a-][\\-0-9A-Z_a-]*
>
> which makes the ":" optional.
Yes, that's closely connected to the @keywords issues. In
N3, you can write terms without the colons:
@keywords is, of a.
@prefix : <#>.
sky color blue.
I can't remember if this is documented... ah yes... see
section
Getting rid of the leading ":" with @keywords
of http://www.w3.org/2000/10/swap/doc/Shortcuts.html
--
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541 0875 0F91 96DE 6E52 C29E
Received on Monday, 19 June 2006 16:20:36 UTC