Re: Surrogate Code Points in Tests? from Eric Prud'hommeaux on 2013-05-17 (public-rdf-comments@w3.org from May 2013)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 17 May 2013 07:37:30 -0400
To: Alex Milowski <alex@milowski.com>
Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <20130517113728.GC13487@w3.org>

* Alex Milowski <alex@milowski.com> [2013-05-16 23:44-0700]
> In looking at test:
> 
>    prefix_with_PN_CHARS_BASE_character_boundaries.ttl  [1]
> 
> There are the code points u+dc00, u+db7f, and u+dfff in the last part of
> the prefix.  The code points u+d800-u+dfff are not valid unicode characters.
> 
> Why are these in a positive test?
> 
> [1]
> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/tests-ttl/prefix_with_PN_CHARS_BASE_character_boundaries.ttl

In the prefix, I see these codepoints: u+41 u+5a u+61 u+7a u+c0 u+d6
u+d8 u+f6 u+f8 u+2ff u+370 u+37d u+37f u+1fff u+200c u+200d u+2070
u+218f u+2c00 u+2fef u+3001 u+d7ff u+f900 u+fdcf u+fdf0 u+fffd u+10000
u+effff .

I've attached two variants the test, one encoded in UTF-8, which is
legal turtle and has only the codepoints above, and UTF-16, which uses
surrogate to encode the codepoints u+10000 and u+effff . Is it
possible that your buffer got re-encoded as UTF-16 at some point?

If this message resolves your comment, please reply with "[RESOLVED]"
in the subject.

> -- 
> --Alex Milowski
> "The excellence of grammar as a guide is proportional to the paucity of the
> inflexions, i.e. to the degree of analysis effected by the language
> considered."
> 
> Bertrand Russell in a footnote of Principles of Mathematics

-- 
-ericP

Attachments

text/turtle attachment: prefix_with_PN_CHARS_BASE_character_boundaries.ttl
application/octet-stream attachment: prefix_with_PN_CHARS_BASE_character_boundaries.utf-16

Received on Friday, 17 May 2013 11:37:59 UTC