Re: [Fwd: SPARQL: Backslashes in string literals] from Seaborne, Andy on 2005-08-05 (public-rdf-dawg@w3.org from July to September 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Fri, 05 Aug 2005 18:28:13 +0100
To: Dave Beckett <dave.beckett@bristol.ac.uk>, Steve Harris <S.W.Harris@ecs.soton.ac.uk>, RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <42F3A1AD.4020108@hp.com>

I'm assuming the character escapes \t \n etc go in the various string literals. 
  This can be coded into the grammar easily enough:

e.g.
ECHAR            ::=  '\' [tbnrf\"']
STRING_LITERAL1  ::=  "'" ( ([^#x27#x5C#xA#xD]) | <ECHAR> )* "'"

[[although for better error messages a processor might wish to allow any one 
char after \ and give a specific error message if it is not one of the 
acceptable ones.]]

\u and \U presents some choices:

The requirement, as I understand it, is to allow \u in variables, qnames, IRIs 
and strings (non-syntax-constructs) in addition to UTF-8/UTF-16 encoding of 
queries (I guess for input systems that that are limited in some way).

The options seem to be:

1/ Allow \u in variables, qnames, IRIs and strings only.
    Place \u rules in the grammar.

This gets a bit messy and isn't perfect anyway because, for example, you can put 
a space into a qname syntactcally with \u0020 by writing  :x\u0020z  In other 
words, there is going to be some text somewhere that says "don't do that" so 
well as the grammar rules.

This is roughly:

ECHAR            ::=  '\' [tbnrf\"']
UCHAR            ::=  '\' HEX HEX HEX HEX | 'U' HEX HEX HEX HEX HEX HEX HEX HEX

STRING_LITERAL1  ::=  "'" ( ([^#x27#x5C#xA#xD]) | UCHAR | ECHAR )* "'"

NCCHAR1p         ::= ..... | UCHAR

except its ambiguous tokenizing on "\" so:

ECHAR            ::=  [tbnrf\"']
UCHAR            ::=  'u' HEX HEX HEX HEX | 'U' HEX HEX HEX HEX HEX HEX HEX HEX

STRING_LITERAL1  ::=  "'" ( ([^#x27#x5C#xA#xD]) | ('\' (UCHAR | ECHAR )) )* "'"

NCCHAR1p         ::= ..... | ('\' UCHAR)

(or some variation on this theme)


Having this and only allowing the right characters/escape sequences would make 
the grammar very unreadable (need to enumerate both normal and escape forms) so 
while possible technically, fails on communicating with implementers.


2/ State in text that \u is allowed anywhere, assuming the implementation will 
do whatever it is best for it to deal with it.  For flex, that means the 
YY_INPUT (for codepoints in 0-127) and some higher level checking (flex only 
deals in narrow chars I'm informed).

This is keeping the query as a character string without regard to \u escaping in 
the language and expecting processors.  Still have \t escapinng.


3/ State we allow \u in variables, qnames, IRIs and strings.
    Do not place this in the grammar.

The text about "don't do that" applying to illegal items applies again.
Implementers will have work out how to do it.  We should realise that some 
systems may also allow \u anywhere - such systems would strictly be wrong but it 
recognizes the fact that doing \u on the input stream is far easier in some setups.


Any of these are acceptable to me.  Mild pref for 1 over 3.



 Andy

As I was reading around, this might help some people:

Java 1.5 has string extensions for codepoints to cover 32 bit codepoints, as if 
they were UTF-16 encoded (simplifed reason); Java 1.4 does not have codepoint 
operations.

http://java.sun.com/developer/technicalArticles/releases/j2se15/
See "Supplementary Character Support"
and
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

Received on Friday, 5 August 2005 17:28:24 UTC