- From: aphillips via GitHub <sysbot+gh@w3.org>
- Date: Wed, 18 May 2016 17:55:40 +0000
- To: public-annotation@w3.org
(chair hat off) Working in UTF-16 code units has certain advantages, particularly for JavaScript programmers. Some downsides of defining things in UTF-16 code units should be kept in mind: - The actual wire format for content is usually UTF-8 and implementations in some programming languages use UTF-8 rather than UTF-16 internally. Designing offsets around a specific encoding scheme creates counting artifacts when working in the other encoding. - Files often contain escapes such as NCRs that add to the code unit count in the file differently from the code point count. Escape expansion must be taken into account when specifying offset. - Splitting a multi-code unit sequence in the middle (in UTF-8 or UTF-16) produces U+FFFDs in the output and is experienced as a bug. With the rapid and wide adoption of emoji, the frequency of supplementary characters/surrogate pair sequences in UTF-16 can no longer be considered a rare oddity or quirk. - The points about grapheme boundary selection are, as @duerst suggests, not about the low-level definition of the annotation format. However, there should be recommended language related to text boundary processing so that implementations consider the needs of customers whose languages use combining marks or other complex combining sequences that are possible in Unicode. - Past shortfalls in JS are slowly being fixed. The "problems" that JavaScript experienced mostly had to do with how regex interacted with text. For code point boundary detection, it is relatively simple (but still a couple lines of code to be sure) to ensure that the low and high surrogate stick together. On the flip side, a number of other specifications *do* specify things in terms of UTF-16 and UTF-16 is JavaScript's native encoding internally. It may be that the additional implementation complexity of counting code points turns out not to be worth the overhead. If you do go with code units, be sure that it is clear that this does not extend to code units in various legacy (non-Unicode) character encodings that are still sometimes used for storing resources used on the Web. -- GitHub Notification of comment by aphillips Please view or discuss this issue at https://github.com/w3c/web-annotation/issues/206#issuecomment-220107371 using your GitHub account
Received on Wednesday, 18 May 2016 17:55:42 UTC