- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Wed, 13 Sep 2006 16:19:50 +0100
- To: "Denis Gaertner" <denis_gaertner@gmx.net>, <public-sparql-dev@w3.org>
-------- Original Message -------- > From: Denis Gaertner <> > Date: 13 September 2006 12:30 > > Hi, > > I got another question. > > It's about character escaping in the regex function. There seem to be > two ways for this. > > \\x00 for one character which is ASCII and \u0000 for a unicode > codepoint. This goes fine. I tried \\x{..} as well but doesn't seem to > work in my environment. My problem is that I get escaped characters > which have a longer hexcode in chunks of two, i.e. u + Umlaut U+00FC / > C3BC as "\C3\BC". If you have only characters like that in a foreign > script you get a whole line like this and it is a problem on how to > know which is which. So I was wondering if it is somehow possible to > simply transfer this to \\xc3\\xbc.. in a regular expression without > having to use unicode codepoints. > > Thanks again > > Denis Hi Denis, A SPARQL query string is defined to be UTF-8. If you're working in an UTF-8 aware editor you can put in exactly the characters you need (being careful if written to disk etc etc). SPARQL uses the Xpath/Xquery regular expression language. You have to follow a bunch of links for this because the Xpath/Xquery regular expression language itself refers back to the XML schema regex language. Links below. Now \x isn't a legal escape sequence ... well - I can't find it anyway. You shodul use the \u form. I guess your engine is relying on some underlying regex engine that just happens to have \x built in. Consider "\\x20" That is a SPARQL string with \ (one of them), then "x", "2", "0". It should match that substring and only that substring, not a space. But to, say, just for example, randome choice, Java, that is another escape sequence meaning ASCII 20 (space). java.util.regex.Pattern describes the pattern language - it is a superset of the Xpath/Xquery regex language. The bug is that the SPARQL engine should re-escape the literal "\" so the string is seen as a regex for Java with the literal "\" not the \x escape sequence. But. Other things are legal so it isn't just a matter of (re)escaping the "\" automatically: \p{Lu} for example (Unicode uppercase letters) is legal. That would be "\\p{Lu}" \\ to get the \ into the string (SPARQL) escape. Then \p is meaningful. And oddly "\\n" and "\n" match the same thing. I think that, strictly, the SPARQL processor should parse the regex to find the escapes that are and are not legal. The "\\n" form passes "\" and "n" to the regex engine which is an escape as a newline. But "\n" has a raw newline put there by the SPARQL parser. Andy http://www.w3.org/TR/xmlschema-2/#regexs as modified by: http://www.w3.org/TR/xpath-functions/#regex-syntax
Received on Wednesday, 13 September 2006 15:20:21 UTC