\u escapes from Seaborne, Andy on 2006-03-10 (public-rdf-dawg@w3.org from January to March 2006)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Fri, 10 Mar 2006 16:36:59 +0000
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <4411AB2B.5090409@hp.com>

An implementation experience:
[Eric and I discussed this on IRC yesterday]

This is not about the Unicode comments Eric is addressing.

The way it is now, with \u restricted to specific places in the grammar, it 
actually makes it harder to parse strictly and makes the automatically 
produced grammars less useful because some text rules also have to be applied. 
  The text rules aren't captured by the yacker produced grammars.

Take variables:  ?x\u0020y is an illegal variable.

The EBNF passes it because UCHAR is on the token rule NCCHAR1p and there is no 
parser restriction to be a legal char - it's the text in A.6. that says that.

So the query engine has to check the variable name after processing \u 
sequences, but that is after the parser has run and found a varibable.  It is 
a nuisance because it has been done by the parser if no \u were involved.

[[ARQ gets it wrong for variables and PREFIXes at the moment because I forget 
to check again -  full prefixed names and IRIs are OK because IRI checking is 
applied and it is caught there.]]

A simpler design is to move where \u processing occurs, to put it before the 
parser tokenizer, then all \u have been turned into their respective 
characters before the ENF grammar is applied.

The sequence of processing is then like:
   bytes => chars
   chars => chars expanding \u and \U
   chars => parser

and the parser rule for variable only generates legal variables, and error 
messages are likely made easier and more consistent.  I think it is more like 
other systems already do it.

This makes \u escapes legal everywhere but it is processed before the grammar 
sees them.

((
\u is legal everywhere in Java, by the way) including comments. You can't 
write in Java "// This is a \u in a comment"
))

The changes would be:

+ Remove \u text from A.5 - leave the text about \n etc.
+ Remove A.6
+ Have a new section (ideally with the Unicode text) saying \u is applied 
before the EBNF is used.
+ Remove UCHAR and HEX from the grammar.

In favour of this change:

+ All legal queries are still legal.
+ All the work of checking a character is legal at a given point in the query 
is done by the parser.
+ The grammars generated by yacker do not require the extra checking for legal 
variables and prefixed names.

so it meets the "automatically generated" comment better.

Against:

- It's a change.

It would make some illegal queries (\u in places currently not allowed) become 
legal.

I view this as something as based on implementation experience and makes all 
the grammars we have produced more useful to implementers.

 Andy

Received on Friday, 10 March 2006 16:37:18 UTC