[RESOLVED] Re: Surrogate Code Points in Tests? from Alex Milowski on 2013-11-10 (public-rdf-comments@w3.org from November 2013)

From: Alex Milowski <alex@milowski.com>
Date: Sat, 9 Nov 2013 19:05:12 -0800
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <CABp3FNKvhsiYFHM4M_TtGHxP_-n3F+g5dRjz3gRjacRAQ7bZkA@mail.gmail.com>
This seems quite sufficient.  I'd like re-check my implementation
against whatever changes were made to the test suite but I'm not sure
when I can make that happen at this point until sometime after
mid-December.

On Sat, Nov 2, 2013 at 5:02 PM, Eric Prud'hommeaux <eric@w3.org> wrote:
> * Alex Milowski <alex@milowski.com> [2013-05-17 11:17-0700]
>> The attached file (UTF-8 version) is essentially the same as the one I got
>> from the test suite with just slightly different line endings.
>>
>> The problem here is that U+10000 and U+EFFFF are not representable in UCS-2
>> and it appears that most browsers are using UCS-2 at some level for the
>> strings in Javascript.  As such, these code point are turned into surrogate
>> pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually part of
>> UCS-2 as legal code points.  The surrogate characters are also not allowed
>> by the productions in prefixes and other such names.
>>
>> The test is unusable in an environment, like most current browsers, that
>> produce surrogate pairs from the UTF-8 encoding.  That is, unusable without
>> some enhanced level of support for surrogates.
>>
>> There are a number of options to resolve this:
>>
>>    1. The test will just fail for such environments without special support
>> for surrogates.
>>    2. We mark this test as requiring surrogate / UTF-16 handling.  Passing
>> the test is quality of implementation question.
>>    3. Such environments are expected to check for pairs of surrogate values
>> in place of the code points [#x10000-#xEFFFF] even though the resulting
>> identifier is unlikely to be generally usable.
>>
>> Option (1) means I just fail the test and move on. I'm not particularly
>> partial to this.
>>
>> Option (2) allows greater diversity of processor environments but certainly
>> runs counter to promoting proper Unicode support.  It would be good to have
>> a variant that tests only those in the Basic Multilingual Plane.
>>
>> Option (3) allows a processor to pass and punts the use of the code point
>> back to the user.  In the case of these particular code points and
>> identifiers, the user of the result will have similar issues when
>> inspecting or using values from the processor.  This will be no different
>> for data received from other places that has the same code points as
>> Javascript will treat them all the same and produce surrogate pairs.
>>
>> There absolutely needs to be much more documentation attached to this test.
>>  The identifiers (e.g. prefix) are constructed by taking the first and last
>> character of each range in the PN_CHARS_BASE production.  I have to admit I
>> really didn't recognize that until much later in my research into why this
>> test didn't pass.  In addition, the general issue of UCS-2  vs UTF-16
>> representations and surrogates needs to be minimally referenced in the
>> description.
>>
>> It would be good for everyone else to avoid a lot the research I just did
>> with a well-written test description.
>>
>> BTW, there's a good write up of Javascript's issues at [1]
>>
>> [1] http://mathiasbynens.be/notes/javascript-encoding
>
> Dear Alex,
>
> On 17 May, in
> <http://www.w3.org/mid/CABp3FNKKtN8vac7=7gjzL2Y_KfH9wq0p56zAA5KAaO8Rv-SmEg@mail.gmail.com>,
> you proposed 3 alternatives for addressing characters outside of the
> range of #0000-#FFFD. Your third choice got support from others on rdf-comments:
> [[
>  3. Such environments are expected to check for pairs of surrogate values
> in place of the code points [#x10000-#xEFFFF] even though the resulting
> identifier is unlikely to be generally usable.
> ]]
>
> In response to this, we added the following text to the README in the
> turtle test suite:
> [[
> CHARACTER ENCODING:
>
> The Turtle language uses UTF-8 encoding. The following tests include
> non-ascii characters:
>   localName_with_assigned_nfc_bmp_PN_CHARS_BASE_character_boundaries
>   localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries *
>   localName_with_nfc_PN_CHARS_BASE_character_boundaries *
>   labeled_blank_node_with_PN_CHARS_BASE_character_boundaries *
>   LITERAL1_with_UTF8_boundaries *
>   LITERAL_LONG1_with_UTF8_boundaries *
>   LITERAL2_with_UTF8_boundaries *
>   LITERAL_LONG2_with_UTF8_boundaries *
>
> Those marked with a * include characters with codepoints greater than
> U+FFFD and are thus expressed as a pair of surrogate characters when
> represented in UCS2.
> ]]
>
> This did annotate the tests with non-BNP characters but did neither of:
>
>   Define comparison in terms of surrogate pairs (instead of code points).
>   Indicate that those were not required for conformance.
>
> Noting that non of the test submitters failed those tests:
>   <http://www.w3.org/2011/rdf-wg/wiki/Turtle_Candidate_Recommendation_Comments#Implementations>
> including your own implementation report:
>   <http://www.w3.org/mid/CABp3FNLOOHZHwVSUdUK09Kmp_yujfXZmKOeNBC4TC5V8iUp1Nw@mail.gmail.com>
> I propose to close this comment as satisfactorily addressed by the
>
> additional comments in README. If you disagree, please indicate what
> would satisfy your comment. If you agree, please respond with the
> subject prefixed by "[RESOLVED]".
>
>
>> On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>>
>> > * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]
>> > > In looking at test:
>> > >
>> > >    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
>> > >
>> > > There are the code points u+dc00, u+db7f, and u+dfff in the last part of
>> > > the prefix.  The code points u+d800-u+dfff are not valid unicode
>> > characters.
>> > >
>> > > Why are these in a positive test?
>> > >
>> > > [1]
>> > >
>> > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl
>> >
>> > In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
>> > u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
>> > u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
>> > u+effff .
>> >
>> > I've attached two variants the test, one encoded in UTF-8, which is
>> > legal turtle and has only the codepoints above, and UTF-16, which uses
>> > surrogate to encode the codepoints u+10000 and u+effff . Is it
>> > possible that your buffer got re-encoded as UTF-16 at some point?
>> >
>> > If this message resolves your comment, please reply with "[RESOLVED]"
>> > in the subject.
>> >
>> > > --
>> > > --Alex Milowski
>> > > "The excellence of grammar as a guide is proportional to the paucity of
>> > the
>> > > inflexions, i.e. to the degree of analysis effected by the language
>> > > considered."
>> > >
>> > > Bertrand Russell in a footnote of Principles of Mathematics
>> >
>> > --
>> > -ericP
>> >
>>
>>
>>
>> --
>> --Alex Milowski
>> "The excellence of grammar as a guide is proportional to the paucity of the
>> inflexions, i.e. to the degree of analysis effected by the language
>> considered."
>>
>> Bertrand Russell in a footnote of Principles of Mathematics
>
> --
> -ericP
>
> office: +1.617.599.3509
> mobile: +33.6.80.80.35.59
>
> (eric@w3.org)
> Feel free to forward this message to any list for any purpose other than
> email address distribution.
>
> There are subtle nuances encoded in font variation and clever layout
> which can only be seen by printing this message on high-clay paper.



-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
Received on Sunday, 10 November 2013 03:05:40 UTC