RE: character escaping from Seaborne, Andy on 2006-09-14 (public-sparql-dev@w3.org from July to September 2006)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Thu, 14 Sep 2006 09:39:41 +0100
To: "Denis Gaertner" <denis_gaertner@gmx.net>, <public-sparql-dev@w3.org>
Message-ID: <86FE9B2B91ADD04095335314BE6906E868AE38@sdcexc04.emea.cpqcorp.net>
-------- Original Message --------
> From: Seaborne, Andy <>
> Date: 13 September 2006 16:20
> 
> -------- Original Message --------
> > From: Denis Gaertner <>
> > Date: 13 September 2006 12:30
> > 
> > Hi,
> > 
> > I got another question.
> > 
> > It's about character escaping in the regex function. There seem to
be
> > two ways for this. 
> > 
> > \\x00 for one character which is ASCII and \u0000 for a unicode
> > codepoint. This goes fine. I tried \\x{..} as well but doesn't seem
to
> > work in my environment. My problem is that I get escaped characters
> > which have a longer hexcode in chunks of two, i.e.  u + Umlaut
U+00FC
> > / C3BC as "\C3\BC". If you have only characters like that in a
foreign
> > script you get a whole line like this and it is a problem on how to
> > know which is which. So I was wondering if it is somehow possible to
> > simply transfer this to \\xc3\\xbc.. in a regular expression without
> > having to use unicode codepoints.
> > 
> > Thanks again
> > 
> > Denis
> 
> Hi Denis,
> 
> A SPARQL query string is defined to be UTF-8.  If you're working in an
> UTF-8 aware editor you can put in exactly the characters you need
> (being careful if written to disk etc etc). 
> 
> SPARQL uses the Xpath/Xquery regular expression language.  You have to
> follow a bunch of links for this because the Xpath/Xquery regular
> expression language itself refers back to the XML schema regex
> language.   
> Links below.
> 
> Now \x isn't a legal escape sequence ... well - I can't find it
anyway.
> You shodul use the \u form.  I guess your engine is relying on some
> underlying regex engine that just happens to have \x built in. 
> 
> Consider "\\x20"  That is a SPARQL string with \ (one of them), then
> "x", "2", "0".   It should match that substring and only that
substring,
> not a space.  But to, say, just for example, randome choice, Java,
that
> is another escape sequence meaning ASCII 20 (space). 
> java.util.regex.Pattern describes the pattern language - it is a
> superset of the Xpath/Xquery regex language. 
> 
> The bug is that the SPARQL engine should re-escape the literal "\" so
> the string is seen as a regex for Java with the literal "\" not the \x
> escape sequence.  

I was wrong about this bit : "\\x20" is illegal as a regular expression;
it is not a plain string to be matched.  \\x becomes \x in the regular
exprssion pattern and that's an illegal escape sequence by 
http://www.w3.org/TR/xmlschema-2/#regexs

As far as I can detemerine the situation is:

The relevant rules are : 
 [9] 	atom	         ::=  Char | charClass | ( '(' regExp ')' )
 [10] Char           ::=  [^.\?*+{}()|^$#x5B#x5D]
 [11]	charClass	   ::=  charClassEsc | charClassExpr |
WildcardEsc  
 [23] charClassEsc   ::= 
              ( SingleCharEsc | MultiCharEsc | catEsc | complEsc |
backReference )
 [23a] backReference ::= "\" [1-9][0-9]*
 [24]	SingleCharEsc  ::=  '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]
 [25]	catEsc	   ::=  '\p{' charProp '}'
 [26]	complEsc	   ::=  '\P{' charProp '}'
 [37]	MultiCharEsc   ::=  '\' [sSiIcCdDwW]

(10, 11, 23, and 24 are modified by
http://www.w3.org/TR/xpath-functions/#regex-syntax)

so the escapes in Java, but not in SPARQL are:

\e \x \b \B \G \Z \A \Q \E \u

and \c has a different meaning. \u has a bizarre interaction: \u0020 is
legal, \\u0020 is in the Java escapes but not the SPARQL ones.

POSIX and java.lang.Character classes are illegal in \p{}, \P{}

More details:
http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/im
pl/xpath/regex/RegularExpression.html

	Andy
> 
> But.  Other things are legal so it isn't just a matter of (re)escaping
> the "\" automatically: \p{Lu} for example (Unicode uppercase letters)
> is legal.  That would be "\\p{Lu}" \\ to get the \ into the string
> (SPARQL) escape.  Then \p is meaningful.   
> 
> And oddly "\\n" and "\n" match the same thing.
> 
> I think that, strictly, the SPARQL processor should parse the regex to
> find the escapes that are and are not legal. 
> 
> The "\\n" form passes "\" and "n" to the regex engine which is an
> escape as a newline.  But "\n" has a raw newline put there by the
> SPARQL parser.  
> 
> 	Andy
> 
> 
> http://www.w3.org/TR/xmlschema-2/#regexs
> as modified by:
> http://www.w3.org/TR/xpath-functions/#regex-syntax
Received on Thursday, 14 September 2006 08:39:54 UTC