Re: Surrogate Code Points in Tests? from Alex Milowski on 2013-05-17 (public-rdf-comments@w3.org from May 2013)

From: Alex Milowski <alex@milowski.com>
Date: Fri, 17 May 2013 14:17:53 -0700
To: Peter Occil <poccil14@gmail.com>
Cc: Richard Cyganiak <richard@cyganiak.de>, "Eric Prud'hommeaux" <eric@w3.org>, "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <CABp3FNLy6vZUwTNDmEYuT52oHRmwgosGx1uwfaFiZ991tdcCbg@mail.gmail.com>
It would be great to mark these tests with something to indicate these
concerns.  Option (3) is really a solution you can implement.


On Fri, May 17, 2013 at 2:13 PM, Peter Occil <poccil14@gmail.com> wrote:

>   I believe the point of what Alex is saying is that an implementation in
> JavaScript that represents Turtle documents as strings ought to take care
> to interpret
> Unicode characters in strings correctly, since Unicode supports many more
> characters than the 16-bit code units in JavaScript strings can represent.
> For example, the charCodeAt method in JavaScript returns only the value
> of the 16-bit code unit in the given index of the string, not necessarily
> the Unicode character found there; and therefore can return a surrogate
> code point.
>
> This applies not only to JavaScript, but also Java, .NET, and any other
> environment
> whose strings consist of 16-bit Unicode code units.
>
> Accordingly, I believe this issue can be resolved by option 2.
>
> --Peter
>
>  *From:* Richard Cyganiak <richard@cyganiak.de>
> *Sent:* Friday, May 17, 2013 3:45 PM
> *To:* Alex Milowski <alex@milowski.com>
> *Cc:* Eric Prud'hommeaux <eric@w3.org> ; public-rdf-comments@w3.org
> *Subject:* Re: Surrogate Code Points in Tests?
>
>  I understand only half of what's being discussed in this thread. Maybe
> this is relevant: RDF Concepts says that IRIs and literals SHOULD be in
> Unicode normal form C.
>
> Apologies if that's not relevant here.
>
> Best,
> Richard
>
>
> On 17 May 2013, at 19:17, Alex Milowski <alex@milowski.com> wrote:
>
>  The attached file (UTF-8 version) is essentially the same as the one I
> got from the test suite with just slightly different line endings.
>
> The problem here is that U+10000 and U+EFFFF are not representable in
> UCS-2 and it appears that most browsers are using UCS-2 at some level for
> the strings in Javascript.  As such, these code point are turned into
> surrogate pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually
> part of UCS-2 as legal code points.  The surrogate characters are also not
> allowed by the productions in prefixes and other such names.
>
> The test is unusable in an environment, like most current browsers, that
> produce surrogate pairs from the UTF-8 encoding.  That is, unusable without
> some enhanced level of support for surrogates.
>
> There are a number of options to resolve this:
>
>    1. The test will just fail for such environments without special
> support for surrogates.
>    2. We mark this test as requiring surrogate / UTF-16 handling.  Passing
> the test is quality of implementation question.
>    3. Such environments are expected to check for pairs of surrogate
> values in place of the code points [#x10000-#xEFFFF] even though the
> resulting identifier is unlikely to be generally usable.
>
> Option (1) means I just fail the test and move on. I'm not particularly
> partial to this.
>
> Option (2) allows greater diversity of processor environments but
> certainly runs counter to promoting proper Unicode support.  It would be
> good to have a variant that tests only those in the Basic Multilingual
> Plane.
>
> Option (3) allows a processor to pass and punts the use of the code point
> back to the user.  In the case of these particular code points and
> identifiers, the user of the result will have similar issues when
> inspecting or using values from the processor.  This will be no different
> for data received from other places that has the same code points as
> Javascript will treat them all the same and produce surrogate pairs.
>
> There absolutely needs to be much more documentation attached to this
> test.  The identifiers (e.g. prefix) are constructed by taking the first
> and last character of each range in the PN_CHARS_BASE production.  I have
> to admit I really didn't recognize that until much later in my research
> into why this test didn't pass.  In addition, the general issue of UCS-2
> vs UTF-16 representations and surrogates needs to be minimally referenced
> in the description.
>
> It would be good for everyone else to avoid a lot the research I just did
> with a well-written test description.
>
> BTW, there's a good write up of Javascript's issues at [1]
>
> [1] http://mathiasbynens.be/notes/javascript-encoding
>
>
> On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>
>> * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]
>> > In looking at test:
>> >
>> >    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
>> >
>> > There are the code points u+dc00, u+db7f, and u+dfff in the last part of
>> > the prefix.  The code points u+d800-u+dfff are not valid unicode
>> characters.
>> >
>> > Why are these in a positive test?
>> >
>> > [1]
>> >
>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl
>>
>> In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
>> u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
>> u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
>> u+effff .
>>
>> I've attached two variants the test, one encoded in UTF-8, which is
>> legal turtle and has only the codepoints above, and UTF-16, which uses
>> surrogate to encode the codepoints u+10000 and u+effff . Is it
>> possible that your buffer got re-encoded as UTF-16 at some point?
>>
>> If this message resolves your comment, please reply with "[RESOLVED]"
>> in the subject.
>>
>> > --
>> > --Alex Milowski
>> > "The excellence of grammar as a guide is proportional to the paucity of
>> the
>> > inflexions, i.e. to the degree of analysis effected by the language
>> > considered."
>> >
>> > Bertrand Russell in a footnote of Principles of Mathematics
>>
>> --
>> -ericP
>>
>
>
>
> --
> --Alex Milowski
> "The excellence of grammar as a guide is proportional to the paucity of the
> inflexions, i.e. to the degree of analysis effected by the language
> considered."
>
> Bertrand Russell in a footnote of Principles of Mathematics
>
>


-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
Received on Friday, 17 May 2013 21:18:24 UTC