- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Fri, 10 Mar 2006 16:36:59 +0000
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
An implementation experience: [Eric and I discussed this on IRC yesterday] This is not about the Unicode comments Eric is addressing. The way it is now, with \u restricted to specific places in the grammar, it actually makes it harder to parse strictly and makes the automatically produced grammars less useful because some text rules also have to be applied. The text rules aren't captured by the yacker produced grammars. Take variables: ?x\u0020y is an illegal variable. The EBNF passes it because UCHAR is on the token rule NCCHAR1p and there is no parser restriction to be a legal char - it's the text in A.6. that says that. So the query engine has to check the variable name after processing \u sequences, but that is after the parser has run and found a varibable. It is a nuisance because it has been done by the parser if no \u were involved. [[ARQ gets it wrong for variables and PREFIXes at the moment because I forget to check again - full prefixed names and IRIs are OK because IRI checking is applied and it is caught there.]] A simpler design is to move where \u processing occurs, to put it before the parser tokenizer, then all \u have been turned into their respective characters before the ENF grammar is applied. The sequence of processing is then like: bytes => chars chars => chars expanding \u and \U chars => parser and the parser rule for variable only generates legal variables, and error messages are likely made easier and more consistent. I think it is more like other systems already do it. This makes \u escapes legal everywhere but it is processed before the grammar sees them. (( \u is legal everywhere in Java, by the way) including comments. You can't write in Java "// This is a \u in a comment" )) The changes would be: + Remove \u text from A.5 - leave the text about \n etc. + Remove A.6 + Have a new section (ideally with the Unicode text) saying \u is applied before the EBNF is used. + Remove UCHAR and HEX from the grammar. In favour of this change: + All legal queries are still legal. + All the work of checking a character is legal at a given point in the query is done by the parser. + The grammars generated by yacker do not require the extra checking for legal variables and prefixed names. so it meets the "automatically generated" comment better. Against: - It's a change. It would make some illegal queries (\u in places currently not allowed) become legal. I view this as something as based on implementation experience and makes all the grammars we have produced more useful to implementers. Andy
Received on Friday, 10 March 2006 16:37:18 UTC