- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Fri, 05 Aug 2005 18:28:13 +0100
- To: Dave Beckett <dave.beckett@bristol.ac.uk>, Steve Harris <S.W.Harris@ecs.soton.ac.uk>, RDF Data Access Working Group <public-rdf-dawg@w3.org>
I'm assuming the character escapes \t \n etc go in the various string literals. This can be coded into the grammar easily enough: e.g. ECHAR ::= '\' [tbnrf\"'] STRING_LITERAL1 ::= "'" ( ([^#x27#x5C#xA#xD]) | <ECHAR> )* "'" [[although for better error messages a processor might wish to allow any one char after \ and give a specific error message if it is not one of the acceptable ones.]] \u and \U presents some choices: The requirement, as I understand it, is to allow \u in variables, qnames, IRIs and strings (non-syntax-constructs) in addition to UTF-8/UTF-16 encoding of queries (I guess for input systems that that are limited in some way). The options seem to be: 1/ Allow \u in variables, qnames, IRIs and strings only. Place \u rules in the grammar. This gets a bit messy and isn't perfect anyway because, for example, you can put a space into a qname syntactcally with \u0020 by writing :x\u0020z In other words, there is going to be some text somewhere that says "don't do that" so well as the grammar rules. This is roughly: ECHAR ::= '\' [tbnrf\"'] UCHAR ::= '\' HEX HEX HEX HEX | 'U' HEX HEX HEX HEX HEX HEX HEX HEX STRING_LITERAL1 ::= "'" ( ([^#x27#x5C#xA#xD]) | UCHAR | ECHAR )* "'" NCCHAR1p ::= ..... | UCHAR except its ambiguous tokenizing on "\" so: ECHAR ::= [tbnrf\"'] UCHAR ::= 'u' HEX HEX HEX HEX | 'U' HEX HEX HEX HEX HEX HEX HEX HEX STRING_LITERAL1 ::= "'" ( ([^#x27#x5C#xA#xD]) | ('\' (UCHAR | ECHAR )) )* "'" NCCHAR1p ::= ..... | ('\' UCHAR) (or some variation on this theme) Having this and only allowing the right characters/escape sequences would make the grammar very unreadable (need to enumerate both normal and escape forms) so while possible technically, fails on communicating with implementers. 2/ State in text that \u is allowed anywhere, assuming the implementation will do whatever it is best for it to deal with it. For flex, that means the YY_INPUT (for codepoints in 0-127) and some higher level checking (flex only deals in narrow chars I'm informed). This is keeping the query as a character string without regard to \u escaping in the language and expecting processors. Still have \t escapinng. 3/ State we allow \u in variables, qnames, IRIs and strings. Do not place this in the grammar. The text about "don't do that" applying to illegal items applies again. Implementers will have work out how to do it. We should realise that some systems may also allow \u anywhere - such systems would strictly be wrong but it recognizes the fact that doing \u on the input stream is far easier in some setups. Any of these are acceptable to me. Mild pref for 1 over 3. Andy As I was reading around, this might help some people: Java 1.5 has string extensions for codepoints to cover 32 bit codepoints, as if they were UTF-16 encoded (simplifed reason); Java 1.4 does not have codepoint operations. http://java.sun.com/developer/technicalArticles/releases/j2se15/ See "Supplementary Character Support" and http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
Received on Friday, 5 August 2005 17:28:24 UTC