Re: Surrogate Code Points in Tests? from Peter Occil on 2013-05-17 (public-rdf-comments@w3.org from May 2013)

From: Peter Occil <poccil14@gmail.com>
Date: Fri, 17 May 2013 17:13:40 -0400
To: "Richard Cyganiak" <richard@cyganiak.de>, "Alex Milowski" <alex@milowski.com>
Cc: "Eric Prud'hommeaux" <eric@w3.org>, <public-rdf-comments@w3.org>
Message-ID: <84A3F1167F2E4ED4B9A997B8AC8F1864@PeterPC>
I believe the point of what Alex is saying is that an implementation in
JavaScript that represents Turtle documents as strings ought to take care to interpret 
Unicode characters in strings correctly, since Unicode supports many more 
characters than the 16-bit code units in JavaScript strings can represent. 
For example, the charCodeAt method in JavaScript returns only the value 
of the 16-bit code unit in the given index of the string, not necessarily
the Unicode character found there; and therefore can return a surrogate code point.

This applies not only to JavaScript, but also Java, .NET, and any other environment 
whose strings consist of 16-bit Unicode code units.

Accordingly, I believe this issue can be resolved by option 2.

--Peter

From: Richard Cyganiak 
Sent: Friday, May 17, 2013 3:45 PM
To: Alex Milowski 
Cc: Eric Prud'hommeaux ; public-rdf-comments@w3.org 
Subject: Re: Surrogate Code Points in Tests?

I understand only half of what's being discussed in this thread. Maybe this is relevant: RDF Concepts says that IRIs and literals SHOULD be in Unicode normal form C.

Apologies if that's not relevant here.

Best,
Richard


On 17 May 2013, at 19:17, Alex Milowski <alex@milowski.com> wrote:


  The attached file (UTF-8 version) is essentially the same as the one I got from the test suite with just slightly different line endings.   

  The problem here is that U+10000 and U+EFFFF are not representable in UCS-2 and it appears that most browsers are using UCS-2 at some level for the strings in Javascript.  As such, these code point are turned into surrogate pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually part of UCS-2 as legal code points.  The surrogate characters are also not allowed by the productions in prefixes and other such names. 

  The test is unusable in an environment, like most current browsers, that produce surrogate pairs from the UTF-8 encoding.  That is, unusable without some enhanced level of support for surrogates.

  There are a number of options to resolve this:

     1. The test will just fail for such environments without special support for surrogates.
     2. We mark this test as requiring surrogate / UTF-16 handling.  Passing the test is quality of implementation question.
     3. Such environments are expected to check for pairs of surrogate values in place of the code points [#x10000-#xEFFFF] even though the resulting identifier is unlikely to be generally usable.

  Option (1) means I just fail the test and move on. I'm not particularly partial to this.

  Option (2) allows greater diversity of processor environments but certainly runs counter to promoting proper Unicode support.  It would be good to have a variant that tests only those in the Basic Multilingual Plane.

  Option (3) allows a processor to pass and punts the use of the code point back to the user.  In the case of these particular code points and identifiers, the user of the result will have similar issues when inspecting or using values from the processor.  This will be no different for data received from other places that has the same code points as Javascript will treat them all the same and produce surrogate pairs.

  There absolutely needs to be much more documentation attached to this test.  The identifiers (e.g. prefix) are constructed by taking the first and last character of each range in the PN_CHARS_BASE production.  I have to admit I really didn't recognize that until much later in my research into why this test didn't pass.  In addition, the general issue of UCS-2  vs UTF-16 representations and surrogates needs to be minimally referenced in the description.

  It would be good for everyone else to avoid a lot the research I just did with a well-written test description. 

  BTW, there's a good write up of Javascript's issues at [1]

  [1] http://mathiasbynens.be/notes/javascript-encoding



  On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote:

    * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]

    > In looking at test:
    >
    >    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
    >
    > There are the code points u+dc00, u+db7f, and u+dfff in the last part of
    > the prefix.  The code points u+d800-u+dfff are not valid unicode characters.
    >
    > Why are these in a positive test?
    >
    > [1]
    > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl


    In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
    u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
    u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
    u+effff .

    I've attached two variants the test, one encoded in UTF-8, which is
    legal turtle and has only the codepoints above, and UTF-16, which uses
    surrogate to encode the codepoints u+10000 and u+effff . Is it
    possible that your buffer got re-encoded as UTF-16 at some point?

    If this message resolves your comment, please reply with "[RESOLVED]"
    in the subject.


    > --
    > --Alex Milowski
    > "The excellence of grammar as a guide is proportional to the paucity of the
    > inflexions, i.e. to the degree of analysis effected by the language
    > considered."
    >
    > Bertrand Russell in a footnote of Principles of Mathematics


    --
    -ericP





  -- 
  --Alex Milowski
  "The excellence of grammar as a guide is proportional to the paucity of the
  inflexions, i.e. to the degree of analysis effected by the language
  considered."

  Bertrand Russell in a footnote of Principles of Mathematics
Received on Friday, 17 May 2013 21:14:23 UTC