RE: character escaping from Seaborne, Andy on 2006-09-13 (public-sparql-dev@w3.org from July to September 2006)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 13 Sep 2006 16:19:50 +0100
To: "Denis Gaertner" <denis_gaertner@gmx.net>, <public-sparql-dev@w3.org>
Message-ID: <86FE9B2B91ADD04095335314BE6906E868AD6C@sdcexc04.emea.cpqcorp.net>

-------- Original Message --------
> From: Denis Gaertner <>
> Date: 13 September 2006 12:30
> 
> Hi,
> 
> I got another question.
> 
> It's about character escaping in the regex function. There seem to be
> two ways for this. 
> 
> \\x00 for one character which is ASCII and \u0000 for a unicode
> codepoint. This goes fine. I tried \\x{..} as well but doesn't seem to
> work in my environment. My problem is that I get escaped characters
> which have a longer hexcode in chunks of two, i.e.  u + Umlaut U+00FC
/
> C3BC as "\C3\BC". If you have only characters like that in a foreign
> script you get a whole line like this and it is a problem on how to
> know which is which. So I was wondering if it is somehow possible to
> simply transfer this to \\xc3\\xbc.. in a regular expression without
> having to use unicode codepoints.        
> 
> Thanks again
> 
> Denis

Hi Denis,

A SPARQL query string is defined to be UTF-8.  If you're working in an
UTF-8 aware editor you can put in exactly the characters you need (being
careful if written to disk etc etc).

SPARQL uses the Xpath/Xquery regular expression language.  You have to
follow a bunch of links for this because the Xpath/Xquery regular
expression language itself refers back to the XML schema regex language.
Links below.

Now \x isn't a legal escape sequence ... well - I can't find it anyway.
You shodul use the \u form.  I guess your engine is relying on some
underlying regex engine that just happens to have \x built in.

Consider "\\x20"  That is a SPARQL string with \ (one of them), then
"x", "2", "0".   It should match that substring and only that substring,
not a space.  But to, say, just for example, randome choice, Java, that
is another escape sequence meaning ASCII 20 (space).
java.util.regex.Pattern describes the pattern language - it is a
superset of the Xpath/Xquery regex language.

The bug is that the SPARQL engine should re-escape the literal "\" so
the string is seen as a regex for Java with the literal "\" not the \x
escape sequence. 

But.  Other things are legal so it isn't just a matter of (re)escaping
the "\" automatically: \p{Lu} for example (Unicode uppercase letters) is
legal.  That would be "\\p{Lu}" \\ to get the \ into the string (SPARQL)
escape.  Then \p is meaningful.

And oddly "\\n" and "\n" match the same thing.

I think that, strictly, the SPARQL processor should parse the regex to
find the escapes that are and are not legal.

The "\\n" form passes "\" and "n" to the regex engine which is an escape
as a newline.  But "\n" has a raw newline put there by the SPARQL
parser.

	Andy


http://www.w3.org/TR/xmlschema-2/#regexs
as modified by:
http://www.w3.org/TR/xpath-functions/#regex-syntax

Received on Wednesday, 13 September 2006 15:20:21 UTC