Re: agenda+ Escape sequence for a surrogate in rdf-turtle

Hi Addison,

On 2026-04-08 23:42, Addison Phillips wrote:
> On 4/7/2026 10:12 PM, Fuqiao Xue wrote:
>> https://github.com/w3c/rdf-turtle/issues/131
>> 
>> We discussed this issue last week, but I would still like to discuss 
>> the specific reasons behind this.
>> 
>> In addition, escape sequence for a surrogate was actually already 
>> prohibited in the RDF 1.1 tests. It is not a new restriction 
>> introduced in IDF 1.2, even though the 1.1 standard itself does not 
>> appear to explicitly mention it.
> 
> I'm not sure whether they are "prohibited in the RDF 1.1 tests" (I 
> can't find the specific test this morning). What is probably prohibited 
> are *isolated* surrogates. Prohibiting *paired* surrogates breaks the 
> \u syntax (if you expect it to encode supplementary characters). This 
> syntax is widely used (Java, JavaScript, etc.) and those 
> implementations tend to be relaxed, so this is a potential tripping 
> hazard.
> 
> In any case, the 1.2 specification only allows the \u syntax to encode 
> "Unicode code points" (we like "Unicode code points") in the BMP. It 
> disallows encoding supplementary characters (by using a surrogate 
> pair). 1.1 was unclear because it said "character" instead of "code 
> point". Surrogate pairs might be encoded if "character" were 
> interpreted (wrongly) as "UTF-16 code unit". The 1.2 change is a 
> distinct improvement, in terms of specificity/clarity, but brings us to 
> the problem of potentially breaking 1.1 content that is otherwise fully 
> functional.
> 
> Some implementations of the escape encoder from UTF-16 code units don't 
> check if a surrogate is isolated or not. Decoders properly should turn 
> isolated surrogates into U+FFFD (although some do not). Mixing u/U in a 
> single string isn't usually done (e.g. \u0067\u00c0\U0001F436\u00c7), 
> but is what RDF Turtle is trying to require.
> 
> The new text uses "Unicode code point" in the way I18N would recommend. 
> But it does not call out why this is special, so that implementers are 
> careful. And there should be some care exercised to ensure we don't 
> break things as we improve them.

Indeed.

Based on their tests at 
https://w3c.github.io/rdf-tests/rdf/rdf11/rdf-turtle/#turtle-syntax-bad-numeric-escape-01 
for RDF 1.1, it appears they only tested isolated surrogates, rather 
than paired surrogates. It remains unclear to me how paired surrogates 
behaves in Turtle, at least for RDF 1.1.

Fuqiao

> Addison

Received on Thursday, 9 April 2026 01:09:56 UTC