Re: Surrogate Code Points in Tests? from Alex Milowski on 2013-05-17 (public-rdf-comments@w3.org from May 2013)

From: Alex Milowski <alex@milowski.com>
Date: Fri, 17 May 2013 14:16:23 -0700
To: Richard Cyganiak <richard@cyganiak.de>
Cc: "Eric Prud'hommeaux" <eric@w3.org>, "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <CABp3FN+yGzdwMjFCMV+Rw8PowUJAGLoNQGf6GT1ehwyHqRk_-Q@mail.gmail.com>
Just to clarify, the problem isn't really with the test case.  It properly
encodes U+10000 and U+EFFFF as UTF-8.

The problem is that receiving systems often don't support code points
outside of the BMP.  In this case, Javascript in common browsers (Chrome,
Safari, Firefox) are two very good examples that fail probably due to the
way strings are implemented.

I think my option (3) is the best solution in these kinds of environments.
 I just implemented checking for surrogate pairs and it sees to work fine.


On Fri, May 17, 2013 at 12:45 PM, Richard Cyganiak <richard@cyganiak.de>wrote:

> I understand only half of what's being discussed in this thread. Maybe
> this is relevant: RDF Concepts says that IRIs and literals SHOULD be in
> Unicode normal form C.
>
> Apologies if that's not relevant here.
>
> Best,
> Richard
>
>
> On 17 May 2013, at 19:17, Alex Milowski <alex@milowski.com> wrote:
>
> The attached file (UTF-8 version) is essentially the same as the one I got
> from the test suite with just slightly different line endings.
>
> The problem here is that U+10000 and U+EFFFF are not representable in
> UCS-2 and it appears that most browsers are using UCS-2 at some level for
> the strings in Javascript.  As such, these code point are turned into
> surrogate pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually
> part of UCS-2 as legal code points.  The surrogate characters are also not
> allowed by the productions in prefixes and other such names.
>
> The test is unusable in an environment, like most current browsers, that
> produce surrogate pairs from the UTF-8 encoding.  That is, unusable without
> some enhanced level of support for surrogates.
>
> There are a number of options to resolve this:
>
>    1. The test will just fail for such environments without special
> support for surrogates.
>    2. We mark this test as requiring surrogate / UTF-16 handling.  Passing
> the test is quality of implementation question.
>    3. Such environments are expected to check for pairs of surrogate
> values in place of the code points [#x10000-#xEFFFF] even though the
> resulting identifier is unlikely to be generally usable.
>
> Option (1) means I just fail the test and move on. I'm not particularly
> partial to this.
>
> Option (2) allows greater diversity of processor environments but
> certainly runs counter to promoting proper Unicode support.  It would be
> good to have a variant that tests only those in the Basic Multilingual
> Plane.
>
> Option (3) allows a processor to pass and punts the use of the code point
> back to the user.  In the case of these particular code points and
> identifiers, the user of the result will have similar issues when
> inspecting or using values from the processor.  This will be no different
> for data received from other places that has the same code points as
> Javascript will treat them all the same and produce surrogate pairs.
>
> There absolutely needs to be much more documentation attached to this
> test.  The identifiers (e.g. prefix) are constructed by taking the first
> and last character of each range in the PN_CHARS_BASE production.  I have
> to admit I really didn't recognize that until much later in my research
> into why this test didn't pass.  In addition, the general issue of UCS-2
>  vs UTF-16 representations and surrogates needs to be minimally referenced
> in the description.
>
> It would be good for everyone else to avoid a lot the research I just did
> with a well-written test description.
>
> BTW, there's a good write up of Javascript's issues at [1]
>
> [1] http://mathiasbynens.be/notes/javascript-encoding
>
>
> On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>
>> * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]
>> > In looking at test:
>> >
>> >    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
>> >
>> > There are the code points u+dc00, u+db7f, and u+dfff in the last part of
>> > the prefix.  The code points u+d800-u+dfff are not valid unicode
>> characters.
>> >
>> > Why are these in a positive test?
>> >
>> > [1]
>> >
>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl
>>
>> In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
>> u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
>> u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
>> u+effff .
>>
>> I've attached two variants the test, one encoded in UTF-8, which is
>> legal turtle and has only the codepoints above, and UTF-16, which uses
>> surrogate to encode the codepoints u+10000 and u+effff . Is it
>> possible that your buffer got re-encoded as UTF-16 at some point?
>>
>> If this message resolves your comment, please reply with "[RESOLVED]"
>> in the subject.
>>
>> > --
>> > --Alex Milowski
>> > "The excellence of grammar as a guide is proportional to the paucity of
>> the
>> > inflexions, i.e. to the degree of analysis effected by the language
>> > considered."
>> >
>> > Bertrand Russell in a footnote of Principles of Mathematics
>>
>> --
>> -ericP
>>
>
>
>
> --
> --Alex Milowski
> "The excellence of grammar as a guide is proportional to the paucity of the
> inflexions, i.e. to the degree of analysis effected by the language
> considered."
>
> Bertrand Russell in a footnote of Principles of Mathematics
>
>


-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
Received on Friday, 17 May 2013 21:16:57 UTC