Re: Surrogate Code Points in Tests? from Richard Cyganiak on 2013-05-17 (public-rdf-comments@w3.org from May 2013)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Fri, 17 May 2013 20:45:16 +0100
To: Alex Milowski <alex@milowski.com>
Cc: Eric Prud'hommeaux <eric@w3.org>, "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-Id: <74AA8676-9E43-492B-9AA4-70513F5CDE64@cyganiak.de>
I understand only half of what's being discussed in this thread. Maybe this is relevant: RDF Concepts says that IRIs and literals SHOULD be in Unicode normal form C.

Apologies if that's not relevant here.

Best,
Richard


On 17 May 2013, at 19:17, Alex Milowski <alex@milowski.com> wrote:

> The attached file (UTF-8 version) is essentially the same as the one I got from the test suite with just slightly different line endings.  
> 
> The problem here is that U+10000 and U+EFFFF are not representable in UCS-2 and it appears that most browsers are using UCS-2 at some level for the strings in Javascript.  As such, these code point are turned into surrogate pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually part of UCS-2 as legal code points.  The surrogate characters are also not allowed by the productions in prefixes and other such names.
> 
> The test is unusable in an environment, like most current browsers, that produce surrogate pairs from the UTF-8 encoding.  That is, unusable without some enhanced level of support for surrogates.
> 
> There are a number of options to resolve this:
> 
>    1. The test will just fail for such environments without special support for surrogates.
>    2. We mark this test as requiring surrogate / UTF-16 handling.  Passing the test is quality of implementation question.
>    3. Such environments are expected to check for pairs of surrogate values in place of the code points [#x10000-#xEFFFF] even though the resulting identifier is unlikely to be generally usable.
> 
> Option (1) means I just fail the test and move on. I'm not particularly partial to this.
> 
> Option (2) allows greater diversity of processor environments but certainly runs counter to promoting proper Unicode support.  It would be good to have a variant that tests only those in the Basic Multilingual Plane.
> 
> Option (3) allows a processor to pass and punts the use of the code point back to the user.  In the case of these particular code points and identifiers, the user of the result will have similar issues when inspecting or using values from the processor.  This will be no different for data received from other places that has the same code points as Javascript will treat them all the same and produce surrogate pairs.
> 
> There absolutely needs to be much more documentation attached to this test.  The identifiers (e.g. prefix) are constructed by taking the first and last character of each range in the PN_CHARS_BASE production.  I have to admit I really didn't recognize that until much later in my research into why this test didn't pass.  In addition, the general issue of UCS-2  vs UTF-16 representations and surrogates needs to be minimally referenced in the description.
> 
> It would be good for everyone else to avoid a lot the research I just did with a well-written test description. 
> 
> BTW, there's a good write up of Javascript's issues at [1]
> 
> [1] http://mathiasbynens.be/notes/javascript-encoding
> 
> 
> On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>> * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]
>> > In looking at test:
>> >
>> >    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
>> >
>> > There are the code points u+dc00, u+db7f, and u+dfff in the last part of
>> > the prefix.  The code points u+d800-u+dfff are not valid unicode characters.
>> >
>> > Why are these in a positive test?
>> >
>> > [1]
>> > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl
>> 
>> In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
>> u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
>> u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
>> u+effff .
>> 
>> I've attached two variants the test, one encoded in UTF-8, which is
>> legal turtle and has only the codepoints above, and UTF-16, which uses
>> surrogate to encode the codepoints u+10000 and u+effff . Is it
>> possible that your buffer got re-encoded as UTF-16 at some point?
>> 
>> If this message resolves your comment, please reply with "[RESOLVED]"
>> in the subject.
>> 
>> > --
>> > --Alex Milowski
>> > "The excellence of grammar as a guide is proportional to the paucity of the
>> > inflexions, i.e. to the degree of analysis effected by the language
>> > considered."
>> >
>> > Bertrand Russell in a footnote of Principles of Mathematics
>> 
>> --
>> -ericP
> 
> 
> 
> -- 
> --Alex Milowski
> "The excellence of grammar as a guide is proportional to the paucity of the
> inflexions, i.e. to the degree of analysis effected by the language
> considered."
> 
> Bertrand Russell in a footnote of Principles of Mathematics
Received on Friday, 17 May 2013 19:45:44 UTC