Re: [web-annotation] (model) vague definition of charactor position for text position selector from aphillips via GitHub on 2016-05-18 (public-annotation@w3.org from May 2016)

From: aphillips via GitHub <sysbot+gh@w3.org>
Date: Wed, 18 May 2016 17:55:40 +0000
To: public-annotation@w3.org
Message-ID: <issue_comment.created-220107371-1463594139-sysbot+gh@w3.org>

(chair hat off)

Working in UTF-16 code units has certain advantages, particularly for 
JavaScript programmers.

Some downsides of defining things in UTF-16 code units should be kept 
in mind:

- The actual wire format for content is usually UTF-8 and 
implementations in some programming languages use UTF-8 rather than 
UTF-16 internally. Designing offsets around a specific encoding scheme
 creates counting artifacts when working in the other encoding. 

- Files often contain escapes such as NCRs that add to the code unit 
count in the file differently from the code point count. Escape 
expansion must be taken into account when specifying offset.

- Splitting a multi-code unit sequence in the middle (in UTF-8 or 
UTF-16) produces U+FFFDs in the output and is experienced as a bug. 
With the rapid and wide adoption of emoji, the frequency of 
supplementary characters/surrogate pair sequences in UTF-16 can no 
longer be considered a rare oddity or quirk.

- The points about grapheme boundary selection are, as @duerst 
suggests, not about the low-level definition of the annotation format.
 However, there should be recommended language related to text 
boundary processing so that implementations consider the needs of 
customers whose languages use combining marks or other complex 
combining sequences that are possible in Unicode. 

- Past shortfalls in JS are slowly being fixed. The "problems" that 
JavaScript experienced mostly had to do with how regex interacted with
 text. For code point boundary detection, it is relatively simple (but
 still a couple lines of code to be sure) to ensure that the low and 
high surrogate stick together.

On the flip side, a number of other specifications *do* specify things
 in terms of UTF-16 and UTF-16 is JavaScript's native encoding 
internally. It may be that the additional implementation complexity of
 counting code points turns out not to be worth the overhead. If you 
do go with code units, be sure that it is clear that this does not 
extend to code units in various legacy (non-Unicode) character 
encodings that are still sometimes used for storing resources used on 
the Web.

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/206#issuecomment-220107371
 using your GitHub account

Received on Wednesday, 18 May 2016 17:55:42 UTC