Re: Surrogate Code Points in Tests? from Alex Milowski on 2013-05-17 (public-rdf-comments@w3.org from May 2013)

From: Alex Milowski <alex@milowski.com>
Date: Fri, 17 May 2013 11:17:44 -0700
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <CABp3FNKKtN8vac7=7gjzL2Y_KfH9wq0p56zAA5KAaO8Rv-SmEg@mail.gmail.com>
The attached file (UTF-8 version) is essentially the same as the one I got
from the test suite with just slightly different line endings.

The problem here is that U+10000 and U+EFFFF are not representable in UCS-2
and it appears that most browsers are using UCS-2 at some level for the
strings in Javascript.  As such, these code point are turned into surrogate
pairs (U+D800 + U+DC00 and U+DB7F + U+DFFF) which aren't actually part of
UCS-2 as legal code points.  The surrogate characters are also not allowed
by the productions in prefixes and other such names.

The test is unusable in an environment, like most current browsers, that
produce surrogate pairs from the UTF-8 encoding.  That is, unusable without
some enhanced level of support for surrogates.

There are a number of options to resolve this:

   1. The test will just fail for such environments without special support
for surrogates.
   2. We mark this test as requiring surrogate / UTF-16 handling.  Passing
the test is quality of implementation question.
   3. Such environments are expected to check for pairs of surrogate values
in place of the code points [#x10000-#xEFFFF] even though the resulting
identifier is unlikely to be generally usable.

Option (1) means I just fail the test and move on. I'm not particularly
partial to this.

Option (2) allows greater diversity of processor environments but certainly
runs counter to promoting proper Unicode support.  It would be good to have
a variant that tests only those in the Basic Multilingual Plane.

Option (3) allows a processor to pass and punts the use of the code point
back to the user.  In the case of these particular code points and
identifiers, the user of the result will have similar issues when
inspecting or using values from the processor.  This will be no different
for data received from other places that has the same code points as
Javascript will treat them all the same and produce surrogate pairs.

There absolutely needs to be much more documentation attached to this test.
 The identifiers (e.g. prefix) are constructed by taking the first and last
character of each range in the PN_CHARS_BASE production.  I have to admit I
really didn't recognize that until much later in my research into why this
test didn't pass.  In addition, the general issue of UCS-2  vs UTF-16
representations and surrogates needs to be minimally referenced in the
description.

It would be good for everyone else to avoid a lot the research I just did
with a well-written test description.

BTW, there's a good write up of Javascript's issues at [1]

[1] http://mathiasbynens.be/notes/javascript-encoding


On Fri, May 17, 2013 at 4:37 AM, Eric Prud'hommeaux <eric@w3.org> wrote:

> * Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]
> > In looking at test:
> >
> >    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
> >
> > There are the code points u+dc00, u+db7f, and u+dfff in the last part of
> > the prefix.  The code points u+d800-u+dfff are not valid unicode
> characters.
> >
> > Why are these in a positive test?
> >
> > [1]
> >
> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl
>
> In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
> u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
> u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
> u+effff .
>
> I've attached two variants the test, one encoded in UTF-8, which is
> legal turtle and has only the codepoints above, and UTF-16, which uses
> surrogate to encode the codepoints u+10000 and u+effff . Is it
> possible that your buffer got re-encoded as UTF-16 at some point?
>
> If this message resolves your comment, please reply with "[RESOLVED]"
> in the subject.
>
> > --
> > --Alex Milowski
> > "The excellence of grammar as a guide is proportional to the paucity of
> the
> > inflexions, i.e. to the degree of analysis effected by the language
> > considered."
> >
> > Bertrand Russell in a footnote of Principles of Mathematics
>
> --
> -ericP
>



-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
Received on Friday, 17 May 2013 18:18:15 UTC