Escaping unicode literals in URIs in programming language parsers (not Java, C or C++)

Hi,

I'm using the W3C Turtle testsuite to fix parsing bugs in rdf4h, a
Haskell library for handling RDF.
https://github.com/robstewart57/rdf4h
http://hackage.haskell.org/package/rdf4h

Here are the Turtle test cases I'm using:
http://www.w3.org/2013/TurtleTests/

There are a number of failing cases in the library, due to unicode
character sequences not being escaped. For example, parsing
http://www.w3.org/2013/TurtleTests/IRI_with_four_digit_numeric_escape.ttl
should be translated to http://www.w3.org/2013/TurtleTests/IRI_spo.nt
.. I.e.

<http://a.example/\u0073> <http://a.example/p> <http://a.example/o> .

becomes

<http://a.example/s> <http://a.example/p> <http://a.example/o> .

This test case is saying: if  '\u0073' and replace with 's'.

If you look at the unicode character for the latin small letter 's',
it says that the Java and C source code for this unicode character is
"\u0073".
http://www.fileformat.info/info/unicode/char/0073/index.htm

The escape character in Haskell is not the same as Java, C or C++, so
rather than "\uXXXX" it is "\xXXXX". For example:

ghci > "\x0073"
"s"

In Haskelll, "\u" doesn't have any special meaning. The "\" in "\u"
therefore needs escaping with another "\":

ghci > "http://a.example/\x0073"
"http://a.example/s"
ghci > "http://a.example/\\u0073"
"http://a.example/\\u0073"

My question is this: Should http://a.example/\u0073 always be
translated to the URI http://a.example/s for every RDF parser for any
programming language? Or are the Turtle W3C test cases about escaping
\uXXX in URIs specific only to RDF parsers for Java, C and C++?

I've asked a related question on Stack Overflow which provides more detail:
http://stackoverflow.com/questions/33250184/unescaping-unicode-literals-found-in-haskell-strings

Thanks,

--
Rob Stewart

Received on Friday, 23 October 2015 18:12:01 UTC