- From: Dan Connolly <connolly@w3.org>
- Date: Mon, 19 Jun 2006 11:20:25 -0500
- To: "Seaborne, Andy" <andy.seaborne@hp.com>
- Cc: public-cwm-talk@w3.org, navyrain@navyrain.net
On Mon, 2006-06-19 at 16:12 +0100, Seaborne, Andy wrote: > Some experiences while trying to write a parser for Turtle: I had hoped > to have a combined N3/Turtle parser with a switch to restrict to Turtle. > This is beginning to look hard/impossible because of #1 and #2 (well - > nothing is impossible, it just means the work has to moved out of the > parser into a late rproicessing stage). I think that this goal is important enough that we should change the languages to make it feasible... > My current development Turtle grammar is attached - it passes the Turtle > test suite but I don't consider it finished. It's extracted from SPARQL > so it allows dots inside qnames. > > == #1 : Tokenizing [...] > > In n3.n3, I see: > > langcode cfg:matches "[a-z]+(-[a-z0-9]+)*"; > cfg:canStartWith "a". I just hit that issue in the derivative of n3.n3 that I'm working on. http://www.w3.org/2000/10/swap/grammar/notation3.bnf I changed it so that the @ is part of the langcode token: [33] langcode ::= "@" [a-z]+ ("-" [a-z0-9]+)* I haven't convinced timbl to move away from n3.n3 as the "truth" yet, but the discussion has started, and I'm getting there. My current target is a JavaScript parser based on http://www.navyrain.net/compilergeneratorinjavascript/ I've got python code that converts the .bnf to turtle, then runs some N3 rules on the turtle to simplify the grammar, then reads the result and simplifies it to a JSON structure and prints that out. The Makefile has the details... [[ ebnf: notation3.rdf ebnf.rdf notation3.json notation3.json: notation3-bnf.n3 gramLL1.py PYTHONPATH=$(HOME)/lib/python:../.. $P gramLL1.py notation3-bnf.n3 >$@ CHATTY=0 notation3-bnf.n3: notation3.n3 ebnf2bnf.n3 first_follow.n3 $P $C notation3.n3 ebnf2bnf.n3 --chatty=$(CHATTY) \ --think --data >$@ notation3.n3: notation3.bnf ebnf2turtle.py $P ebnf2turtle.py notation3.bnf n3 'http://www.w3.org/2000/10/swap/grammar/notation3#' >$@ ebnf.rdf: ebnf.n3 notation3.rdf: notation3.n3 ]] -- http://www.w3.org/2000/10/swap/grammar/Makefile > How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from "a" > the langcode? With a hand-crafted python lexer/parser that I want to get rid of. http://www.w3.org/2000/10/swap/notation3.py > [[ > Aside: I see, in n3.n3: > > # - @keywords affects tokenizing > > Isn't this the same thing as typedef's in C where the token tables > change as the language is parsed? I don't know how to handle this in > javacc nor antlr. > ]] I haven't worked all the way thru that issue yet either. > == #2 : align whitespace between N3 and Turtle > > This is not legal Turtle: > > <a><b><c> . > > by: > [4] triples ::= subject ws+ predicateObjectList > > because it has no whitespace between the subject/predicate. But it is > N3 and is reasonable RDF. It also means the parser itself can't be > whitespace independent, leaving whitespace handling to the lexer to > merely split terminals as necessary. Let's please change turtle there. Dave? > == #3 \u escapes > > Long form: (do not do it like SPARQL!) > http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0443.html I'm ignoring \u escapes at least for now, sorta hoping they'll go away ;-) In that sense, I find it appealing to treat \u escapes in a layer that's separate from lexical analysis and parsing... > Suggestion: > > Define processing as: > > 1/ apply \u escaping at the lowest level - applies to the input stream > so by the end of this, the parser does not see \u as an escape sequence. > \u works everywhere > > At this point we have a stream of characters or UTF-8 depending on your > toolkit technology. > > 2/ Tokenizing - to create a stream of tokens (usually done lazily) > > 3/ Parsing - apply the grammar > > then \u processing does not need special text or special cases. > > > == Odd and Ends from n3.n3: > > explicituri cfg:matches "<[^>]*>"; > > That includes newlines inside IRIs I changed that in notation3.bnf (and hence notation3.n3 and notation3.rdf ) just the other day. > The qname name token says (removed the \u stuff:) > > (([A-Z_a-z][\\-0-9A-Z_a-z]*)?:)?[A-Z_a-][\\-0-9A-Z_a-]* > > which makes the ":" optional. Yes, that's closely connected to the @keywords issues. In N3, you can write terms without the colons: @keywords is, of a. @prefix : <#>. sky color blue. I can't remember if this is documented... ah yes... see section Getting rid of the leading ":" with @keywords of http://www.w3.org/2000/10/swap/doc/Shortcuts.html -- Dan Connolly, W3C http://www.w3.org/People/Connolly/ D3C2 887B 0F92 6005 C541 0875 0F91 96DE 6E52 C29E
Received on Monday, 19 June 2006 16:20:36 UTC