- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Thu, 14 Sep 2006 09:39:41 +0100
- To: "Denis Gaertner" <denis_gaertner@gmx.net>, <public-sparql-dev@w3.org>
-------- Original Message -------- > From: Seaborne, Andy <> > Date: 13 September 2006 16:20 > > -------- Original Message -------- > > From: Denis Gaertner <> > > Date: 13 September 2006 12:30 > > > > Hi, > > > > I got another question. > > > > It's about character escaping in the regex function. There seem to be > > two ways for this. > > > > \\x00 for one character which is ASCII and \u0000 for a unicode > > codepoint. This goes fine. I tried \\x{..} as well but doesn't seem to > > work in my environment. My problem is that I get escaped characters > > which have a longer hexcode in chunks of two, i.e. u + Umlaut U+00FC > > / C3BC as "\C3\BC". If you have only characters like that in a foreign > > script you get a whole line like this and it is a problem on how to > > know which is which. So I was wondering if it is somehow possible to > > simply transfer this to \\xc3\\xbc.. in a regular expression without > > having to use unicode codepoints. > > > > Thanks again > > > > Denis > > Hi Denis, > > A SPARQL query string is defined to be UTF-8. If you're working in an > UTF-8 aware editor you can put in exactly the characters you need > (being careful if written to disk etc etc). > > SPARQL uses the Xpath/Xquery regular expression language. You have to > follow a bunch of links for this because the Xpath/Xquery regular > expression language itself refers back to the XML schema regex > language. > Links below. > > Now \x isn't a legal escape sequence ... well - I can't find it anyway. > You shodul use the \u form. I guess your engine is relying on some > underlying regex engine that just happens to have \x built in. > > Consider "\\x20" That is a SPARQL string with \ (one of them), then > "x", "2", "0". It should match that substring and only that substring, > not a space. But to, say, just for example, randome choice, Java, that > is another escape sequence meaning ASCII 20 (space). > java.util.regex.Pattern describes the pattern language - it is a > superset of the Xpath/Xquery regex language. > > The bug is that the SPARQL engine should re-escape the literal "\" so > the string is seen as a regex for Java with the literal "\" not the \x > escape sequence. I was wrong about this bit : "\\x20" is illegal as a regular expression; it is not a plain string to be matched. \\x becomes \x in the regular exprssion pattern and that's an illegal escape sequence by http://www.w3.org/TR/xmlschema-2/#regexs As far as I can detemerine the situation is: The relevant rules are : [9] atom ::= Char | charClass | ( '(' regExp ')' ) [10] Char ::= [^.\?*+{}()|^$#x5B#x5D] [11] charClass ::= charClassEsc | charClassExpr | WildcardEsc [23] charClassEsc ::= ( SingleCharEsc | MultiCharEsc | catEsc | complEsc | backReference ) [23a] backReference ::= "\" [1-9][0-9]* [24] SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E] [25] catEsc ::= '\p{' charProp '}' [26] complEsc ::= '\P{' charProp '}' [37] MultiCharEsc ::= '\' [sSiIcCdDwW] (10, 11, 23, and 24 are modified by http://www.w3.org/TR/xpath-functions/#regex-syntax) so the escapes in Java, but not in SPARQL are: \e \x \b \B \G \Z \A \Q \E \u and \c has a different meaning. \u has a bizarre interaction: \u0020 is legal, \\u0020 is in the Java escapes but not the SPARQL ones. POSIX and java.lang.Character classes are illegal in \p{}, \P{} More details: http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/im pl/xpath/regex/RegularExpression.html Andy > > But. Other things are legal so it isn't just a matter of (re)escaping > the "\" automatically: \p{Lu} for example (Unicode uppercase letters) > is legal. That would be "\\p{Lu}" \\ to get the \ into the string > (SPARQL) escape. Then \p is meaningful. > > And oddly "\\n" and "\n" match the same thing. > > I think that, strictly, the SPARQL processor should parse the regex to > find the escapes that are and are not legal. > > The "\\n" form passes "\" and "n" to the regex engine which is an > escape as a newline. But "\n" has a raw newline put there by the > SPARQL parser. > > Andy > > > http://www.w3.org/TR/xmlschema-2/#regexs > as modified by: > http://www.w3.org/TR/xpath-functions/#regex-syntax
Received on Thursday, 14 September 2006 08:39:54 UTC