- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Mon, 19 Jun 2006 16:12:03 +0100
- To: <public-cwm-talk@w3.org>
- Message-ID: <DF5E364A470421429AE6DC96979A4F6FAFF5E6@sdcexc04.emea.cpqcorp.net>
Some experiences while trying to write a parser for Turtle: I had hoped to have a combined N3/Turtle parser with a switch to restrict to Turtle. This is beginning to look hard/impossible because of #1 and #2 (well - nothing is impossible, it just means the work has to moved out of the parser into a late rproicessing stage). My current development Turtle grammar is attached - it passes the Turtle test suite but I don't consider it finished. It's extracted from SPARQL so it allows dots inside qnames. == #1 : Tokenizing I'm using javacc - it generates LL parsers with a separate tokenizer. So it works by reading the input stream, identifying as the stream is read. Like flex/lex - this is the common way. It means that tokens are identified without regard to where in the production rules the parser is currently. Javacc does support context-sensitive tokeizing (can switch between different tokeizing sets based on parser control). I don't mind using the tricky bits but in SPARQL there was a big push not to go there. Tokens are whitespace sensitive ; the parser does not need to be (other that whitespace splits terminals). The tricky case in N3 is language codes. I did it in Turtle and SPARQL by using the @ so a language code token includes the @ ; the token for language code and for "a" are different. In n3.n3, I see: langcode cfg:matches "[a-z]+(-[a-z0-9]+)*"; cfg:canStartWith "a". and don't see how to handle this without using context senstive tokenizing (s special state for langcodes). javacc has such a feature : flex does as well but it makes life just plain tricky. The alternative is have some bland token for any sequence of bare letters without a ":" and test elsewhere. How does the N3 parser in cwm tell "a" (as in "rdf:type") apart from "a" the langcode? [[ Aside: I see, in n3.n3: # - @keywords affects tokenizing Isn't this the same thing as typedef's in C where the token tables change as the language is parsed? I don't know how to handle this in javacc nor antlr. ]] == #2 : align whitespace between N3 and Turtle This is not legal Turtle: <a><b><c> . by: [4] triples ::= subject ws+ predicateObjectList because it has no whitespace between the subject/predicate. But it is N3 and is reasonable RDF. It also means the parser itself can't be whitespace independent, leaving whitespace handling to the lexer to merely split terminals as necessary. == #3 \u escapes Long form: (do not do it like SPARQL!) http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0443.html Short form: The problem is what happens if a significant character is used. SPARQL example: ?ac\u0020 The parser has to look at this twice: once to see a variable, then validate it again after esacape processing to check no illegal characters have appeared. In this case, a space has been put into variable name which is illegal. Turtle avoids this by: 1/ Having several .character rules 2/ Not allowing \u in places like qnames where the full range of Unicode is otherwsie allowed which isn't internationalization-friendly. 3/ Having ranges like [39] ucharacter ::= ( character - #x3E ) | '\>' which are mixing a parser rule and token character So \u003E puts a < into a URI because the character rule accepts \u003E There is text to further modify \u legality but (and this is the SPARQL problem) the rules can't be expressed in a formal grammar. Not sure where N3 is with \u escapes. Suggestion: Define processing as: 1/ apply \u escaping at the lowest level - applies to the input stream so by the end of this, the parser does not see \u as an escape sequence. \u works everywhere At this point we have a stream of characters or UTF-8 depending on your toolkit technology. 2/ Tokenizing - to create a stream of tokens (usually done lazily) 3/ Parsing - apply the grammar then \u processing does not need special text or special cases. == Odd and Ends from n3.n3: explicituri cfg:matches "<[^>]*>"; That includes newlines inside IRIs The qname name token says (removed the \u stuff:) (([A-Z_a-z][\\-0-9A-Z_a-z]*)?:)?[A-Z_a-][\\-0-9A-Z_a-]* which makes the ":" optional.
Attachments
- text/html attachment: turtle.html
Received on Monday, 19 June 2006 15:12:17 UTC