Re: Should a Turtle parser handle UTF-16 surrogate pairs when processing numeric escapes in string literals and IRIs?

* Giovanni Mels <giovanni.mels@agfa.com> [2016-05-23 12:40+0200]
> E.g. consider a string literal "\uD864\uDD54".
> 
> Is this allowed or not? Section 6.4 of the Turtle recommendation (
> https://www.w3.org/TR/turtle/) is not clear on this.

I think this is a unicode question because Turtle (and SPARQL) are
only encoded in UTF-8. If you want to write down the character
U+29154, you could write "\U00029154" (or just "𩅔" which will be
encoded as 0xF0 0xA9 0x85 0x94). I believe the byte sequence 0xD8 0x64
0xDD 0x54 can't be expressed directly in UTF-8 as surrogate pairs are
excluded from UTF-8.

Like in XML, you can encode strings in hexBinary or base64Binary.
That's kind of a pain because you can't directly use string functions
on it, but that's probably reasonable. Using regexp on UTF-8 encodings
of UTF-16 byte sequences is kind of like grepping for byte sequences
in a directory of PNGs.


> "A Unicode character in the range U+0000 to U+FFFF inclusive corresponding 
> to the value encoded by the four hexadecimal digits interpreted from most 
> significant to least significant digit."
> 
> The surrogate values fall in to the range U+0000 to U+FFFF, but are not 
> characters. A Turtle parser should either reject this, or parse it as 
> "\U00029154".
> 
> Both are valid approaches: In Java 'String s = "\uD864\uDD54";' compiles, 
> in C++ 'std::string str = u8"\uD864\uDD54";' gives a compile error.

I think Java is being a bit generous and assuming that "\uD864\uDD54"
is a synonym for U+29154. It can do that pretty easily because it
would express the latter as the former anyways. Come to think of it,
can you even write \u.... sequences for stuff off the BMP without
decoding it into UTF-16 yourself? I recall working around an issue
like that in ecmascript. Ahh, good old UCS-16.


> Kind Regards,
> 
> Giovanni Mels | Agfa HealthCare
> 
> http://www.agfahealthcare.com
> http://blog.agfahealthcare.com
> Click on link to read important disclaimer: 
> http://www.agfahealthcare.com/maildisclaimer 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.

Received on Monday, 23 May 2016 14:06:07 UTC