Re: Allowing \u escaped surrogate pairs

On 28 April 2026 18:03:04 GMT+01:00, "Peter F. Patel-Schneider" <pfpschneider@gmail.com> wrote:
>I'm for not allowing surrogates at all, keeping the situation unchanged.

I'm slightly preferring this option as well, although I would not object to supportinc them.

If I understand correctly i18n's arguments: careless implementations may produce Turtle with surrogate pairs, so we would be more robust in accepting them rather than rejecting them (Postel's law).

I would counter argue that RDF1.1 has been around for more than a decade and AFIK this has never been a problem. Bug again, maybe that's because some careless implementations of parsers do decode then, despite what the spec says :)

>My view is that software that allowed surrogates was non-compliant and should remain non-compliant.
>
>Adding a test for "correct" surrogate pairs is optional.
>
>peter
>
>
>On 4/28/26 9:58 AM, ddooss@wp.pl wrote:
>> Hi all,
>> 
>> 
>> It seems to preserve the RDF 1.2 model - strings still denote Unicode scalar values - while allowing the common UTF-16-style escape form for non-BMP characters, e.g. \uD83C\uDCA1, when the surrogate pair is well-formed.
>> 
>> So my mild preference would be:
>> 
>> accept a valid high-surrogate + low-surrogate pair and interpret it as the corresponding scalar value;
>> 
>> reject lone surrogates, reversed pairs, or malformed surrogate sequences.
>> 
>> That said, I would also be fine with option 1, since it is simpler, stricter, and seems closer to the conservative reading of the current text. Option 2 only seems preferable to me if we want to avoid rejecting data that is probably intended to represent a valid Unicode character.
>> 
>> 
>> Best,
>> 
>> Dominik
>> 
>>     *Dnia 28 kwietnia 2026 14:16* Peter F. Patel-Schneider
>>     <mailto:pfpschneider@gmail.com> < pfpschneider@gmail.com > napisaƂ(a):
>> 
>>     [I'm deliberately not putting this in the issue, because I want the issue to
>>     look clean.]
>> 
>>     As far as I can tell, surrogates are not allowed at all in RDF 1.1 Turtle.
>>     The reason is that numeric escape sequences represent Unicode code points
>>     that
>>     are Unicode characters.  This appears to be only stated in Section 6.4.
>> 
>>     So "\uD83C\uDCA1" is not valid in RDF 1.1 Turtle.
>> 
>>     Again as far as I can tell, RDF 1.2 Turtle liberalizes RDF 1.1 Turtle because
>>     it allows any non-surrogate Unicode code point for numeric escape sequences,
>>     not just Unicode characters.
>> 
>>     So "\uFFFE" is valid in RDF 1.2 Turtle, but not valid in RDF 1.1 Turtle.
>> 
>>     Does anyone disagree with my conclusions?
>> 
>>     peter
>> 
>> 
>> 
>> 
>>     On 4/28/26 4:26 AM, Andy Seaborne wrote:
>> 
>>         As promised at the last telecon, I put together a position for
>>         responding to
>>         the i18n wide review comment [1]
>> 
>>         https://github.com/w3c/rdf-turtle/issues/138 <https://github.com/w3c/
>>         rdf-turtle/issues/138>
>> 
>>         Summary: support valid surrogate pairs written as \u escape sequences.
>> 
>>              Andy
>> 
>>         [1] https://github.com/w3c/rdf-turtle/issues/131 <https://github.com/
>>         w3c/rdf-turtle/issues/131>
>>         https://github.com/w3c/rdf-trig/issues/60 <https://github.com/w3c/rdf-
>>         trig/issues/60>
>> 
>> 
>
>

Received on Tuesday, 28 April 2026 17:55:14 UTC