Should a Turtle parser handle UTF-16 surrogate pairs when processing numeric escapes in string literals and IRIs? from Giovanni Mels on 2016-05-23 (public-rdf-comments@w3.org from May 2016)

From: Giovanni Mels <giovanni.mels@agfa.com>
Date: Mon, 23 May 2016 12:40:38 +0200
To: public-rdf-comments@w3.org
Message-ID: <OFA8823E62.CB549217-ONC1257FBC.00393B0F-C1257FBC.003AA720@agfa.com>

E.g. consider a string literal "\uD864\uDD54".

Is this allowed or not? Section 6.4 of the Turtle recommendation (
https://www.w3.org/TR/turtle/) is not clear on this.

"A Unicode character in the range U+0000 to U+FFFF inclusive corresponding 
to the value encoded by the four hexadecimal digits interpreted from most 
significant to least significant digit."

The surrogate values fall in to the range U+0000 to U+FFFF, but are not 
characters. A Turtle parser should either reject this, or parse it as 
"\U00029154".

Both are valid approaches: In Java 'String s = "\uD864\uDD54";' compiles, 
in C++ 'std::string str = u8"\uD864\uDD54";' gives a compile error.


Kind Regards,

Giovanni Mels | Agfa HealthCare

http://www.agfahealthcare.com
http://blog.agfahealthcare.com
Click on link to read important disclaimer: 
http://www.agfahealthcare.com/maildisclaimer

Received on Monday, 23 May 2016 11:09:28 UTC