Re: [web-annotation] TextPositionSelector, thoughts about Unicode code *point* vs. UTF16 code *unit* from Addison Phillips via GitHub on 2016-09-30 (public-annotation@w3.org from September 2016)

From: Addison Phillips via GitHub <sysbot+gh@w3.org>
Date: Fri, 30 Sep 2016 19:20:57 +0000
To: public-annotation@w3.org
Message-ID: <issue_comment.created-250830391-1475263255-sysbot+gh@w3.org>

@azaroth42: +1

While working in code points is *awesome*, the reality of the Web is 
often that of UTF-16 code units because of DOM String. While the APIs 
and data structures based on UTF-16 code units do not directly 
insulate users from problems with surrogate pairs (and, neither 
surrogates handling nor code point counting deal at all with grapheme 
clustering), proper character handling can and should still be 
provided by higher level implementation and protocols.

No process needs to deal with surrogate code *points* (that is, 
character values in the range U+D800 to U+DFFF). There is no reason to
 state, however, that, just because offsets are defined in UTF-16 code
 units that a process cannot handle supplementary characters (i.e. 
characters represented by a surrogate pair of code *units*)

I18N WG commented about an identical issue at TPAC, but I'm at a loss 
to put my finger on it just now.

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/350#issuecomment-250830391
 using your GitHub account

Received on Friday, 30 September 2016 19:21:12 UTC