- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Thu, 14 Sep 2006 09:39:41 +0100
- To: "Denis Gaertner" <denis_gaertner@gmx.net>, <public-sparql-dev@w3.org>
-------- Original Message --------
> From: Seaborne, Andy <>
> Date: 13 September 2006 16:20
>
> -------- Original Message --------
> > From: Denis Gaertner <>
> > Date: 13 September 2006 12:30
> >
> > Hi,
> >
> > I got another question.
> >
> > It's about character escaping in the regex function. There seem to
be
> > two ways for this.
> >
> > \\x00 for one character which is ASCII and \u0000 for a unicode
> > codepoint. This goes fine. I tried \\x{..} as well but doesn't seem
to
> > work in my environment. My problem is that I get escaped characters
> > which have a longer hexcode in chunks of two, i.e. u + Umlaut
U+00FC
> > / C3BC as "\C3\BC". If you have only characters like that in a
foreign
> > script you get a whole line like this and it is a problem on how to
> > know which is which. So I was wondering if it is somehow possible to
> > simply transfer this to \\xc3\\xbc.. in a regular expression without
> > having to use unicode codepoints.
> >
> > Thanks again
> >
> > Denis
>
> Hi Denis,
>
> A SPARQL query string is defined to be UTF-8. If you're working in an
> UTF-8 aware editor you can put in exactly the characters you need
> (being careful if written to disk etc etc).
>
> SPARQL uses the Xpath/Xquery regular expression language. You have to
> follow a bunch of links for this because the Xpath/Xquery regular
> expression language itself refers back to the XML schema regex
> language.
> Links below.
>
> Now \x isn't a legal escape sequence ... well - I can't find it
anyway.
> You shodul use the \u form. I guess your engine is relying on some
> underlying regex engine that just happens to have \x built in.
>
> Consider "\\x20" That is a SPARQL string with \ (one of them), then
> "x", "2", "0". It should match that substring and only that
substring,
> not a space. But to, say, just for example, randome choice, Java,
that
> is another escape sequence meaning ASCII 20 (space).
> java.util.regex.Pattern describes the pattern language - it is a
> superset of the Xpath/Xquery regex language.
>
> The bug is that the SPARQL engine should re-escape the literal "\" so
> the string is seen as a regex for Java with the literal "\" not the \x
> escape sequence.
I was wrong about this bit : "\\x20" is illegal as a regular expression;
it is not a plain string to be matched. \\x becomes \x in the regular
exprssion pattern and that's an illegal escape sequence by
http://www.w3.org/TR/xmlschema-2/#regexs
As far as I can detemerine the situation is:
The relevant rules are :
[9] atom ::= Char | charClass | ( '(' regExp ')' )
[10] Char ::= [^.\?*+{}()|^$#x5B#x5D]
[11] charClass ::= charClassEsc | charClassExpr |
WildcardEsc
[23] charClassEsc ::=
( SingleCharEsc | MultiCharEsc | catEsc | complEsc |
backReference )
[23a] backReference ::= "\" [1-9][0-9]*
[24] SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]
[25] catEsc ::= '\p{' charProp '}'
[26] complEsc ::= '\P{' charProp '}'
[37] MultiCharEsc ::= '\' [sSiIcCdDwW]
(10, 11, 23, and 24 are modified by
http://www.w3.org/TR/xpath-functions/#regex-syntax)
so the escapes in Java, but not in SPARQL are:
\e \x \b \B \G \Z \A \Q \E \u
and \c has a different meaning. \u has a bizarre interaction: \u0020 is
legal, \\u0020 is in the Java escapes but not the SPARQL ones.
POSIX and java.lang.Character classes are illegal in \p{}, \P{}
More details:
http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/im
pl/xpath/regex/RegularExpression.html
Andy
>
> But. Other things are legal so it isn't just a matter of (re)escaping
> the "\" automatically: \p{Lu} for example (Unicode uppercase letters)
> is legal. That would be "\\p{Lu}" \\ to get the \ into the string
> (SPARQL) escape. Then \p is meaningful.
>
> And oddly "\\n" and "\n" match the same thing.
>
> I think that, strictly, the SPARQL processor should parse the regex to
> find the escapes that are and are not legal.
>
> The "\\n" form passes "\" and "n" to the regex engine which is an
> escape as a newline. But "\n" has a raw newline put there by the
> SPARQL parser.
>
> Andy
>
>
> http://www.w3.org/TR/xmlschema-2/#regexs
> as modified by:
> http://www.w3.org/TR/xpath-functions/#regex-syntax
Received on Thursday, 14 September 2006 08:39:54 UTC