- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Mon, 19 Jun 2006 16:12:03 +0100
- To: <public-cwm-talk@w3.org>
- Message-ID: <DF5E364A470421429AE6DC96979A4F6FAFF5E6@sdcexc04.emea.cpqcorp.net>
Some experiences while trying to write a parser for Turtle: I had hoped
to have a combined N3/Turtle parser with a switch to restrict to Turtle.
This is beginning to look hard/impossible because of #1 and #2 (well -
nothing is impossible, it just means the work has to moved out of the
parser into a late rproicessing stage).
My current development Turtle grammar is attached - it passes the Turtle
test suite but I don't consider it finished. It's extracted from SPARQL
so it allows dots inside qnames.
== #1 : Tokenizing
I'm using javacc - it generates LL parsers with a separate tokenizer.
So it works by reading the input stream, identifying as the stream is
read. Like flex/lex - this is the common way.
It means that tokens are identified without regard to where in the
production rules the parser is currently.
Javacc does support context-sensitive tokeizing (can switch between
different tokeizing sets based on parser control). I don't mind using
the tricky bits but in SPARQL there was a big push not to go there.
Tokens are whitespace sensitive ; the parser does not need to be (other
that whitespace splits terminals).
The tricky case in N3 is language codes. I did it in Turtle and SPARQL
by using the @ so a language code token includes the @ ; the token for
language code and for "a" are different.
In n3.n3, I see:
langcode cfg:matches "[a-z]+(-[a-z0-9]+)*";
cfg:canStartWith "a".
and don't see how to handle this without using context senstive
tokenizing (s special state for langcodes). javacc has such a feature :
flex does as well but it makes life just plain tricky. The alternative
is have some bland token for any sequence of bare letters without a ":"
and test elsewhere.
How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from "a"
the langcode?
[[
Aside: I see, in n3.n3:
# - @keywords affects tokenizing
Isn't this the same thing as typedef's in C where the token tables
change as the language is parsed? I don't know how to handle this in
javacc nor antlr.
]]
== #2 : align whitespace between N3 and Turtle
This is not legal Turtle:
<a><b><c> .
by:
[4] triples ::= subject ws+ predicateObjectList
because it has no whitespace between the subject/predicate. But it is
N3 and is reasonable RDF. It also means the parser itself can't be
whitespace independent, leaving whitespace handling to the lexer to
merely split terminals as necessary.
== #3 \u escapes
Long form: (do not do it like SPARQL!)
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0443.html
Short form:
The problem is what happens if a significant character is used.
SPARQL example: ?ac\u0020
The parser has to look at this twice: once to see a variable, then
validate it again after esacape processing to check no illegal
characters have appeared. In this case, a space has been put into
variable name which is illegal.
Turtle avoids this by:
1/ Having several .character rules
2/ Not allowing \u in places like qnames where the full range of Unicode
is otherwsie allowed which isn't internationalization-friendly.
3/ Having ranges like
[39] ucharacter ::= ( character - #x3E ) | '\>'
which are mixing a parser rule and token character
So \u003E puts a < into a URI because the character rule accepts \u003E
There is text to further modify \u legality but (and this is the SPARQL
problem) the rules can't be expressed in a formal grammar.
Not sure where N3 is with \u escapes.
Suggestion:
Define processing as:
1/ apply \u escaping at the lowest level - applies to the input stream
so by the end of this, the parser does not see \u as an escape sequence.
\u works everywhere
At this point we have a stream of characters or UTF-8 depending on your
toolkit technology.
2/ Tokenizing - to create a stream of tokens (usually done lazily)
3/ Parsing - apply the grammar
then \u processing does not need special text or special cases.
== Odd and Ends from n3.n3:
explicituri cfg:matches "<[^>]*>";
That includes newlines inside IRIs
The qname name token says (removed the \u stuff:)
(([A-Z_a-z][\\-0-9A-Z_a-z]*)?:)?[A-Z_a-][\\-0-9A-Z_a-]*
which makes the ":" optional.
Attachments
- text/html attachment: turtle.html
Received on Monday, 19 June 2006 15:12:17 UTC