[web-annotation] TextPositionSelector, thoughts about Unicode code *point* vs. UTF16 code *unit* from Daniel Weck via GitHub on 2016-08-22 (public-annotation@w3.org from August 2016)

From: Daniel Weck via GitHub <sysbot+gh@w3.org>
Date: Mon, 22 Aug 2016 13:24:39 +0000
To: public-annotation@w3.org
Message-ID: <issues.opened-172451442-1471872277-sysbot+gh@w3.org>

danielweck has just created a new issue for 
https://github.com/w3c/web-annotation:

== TextPositionSelector, thoughts about Unicode code *point* vs. UTF16
 code *unit* ==
Hello all,
CC @iherman @azaroth42

In the EPUB3 CFI (Canonical Fragment Identifier) specification, which 
has a possible use in "Open Annotation in EPUB" ( 
http://www.idpf.org/epub/oa/ ), character-level offsets are defined as
 UTF16 code *units*, not Unicode code *points*.

Current implementations of CFI (parsing / processing libraries, and 
text highlighting / rendering tools) that are written in Javascript 
benefit from direct code *unit* support (i.e. no handling / 
translation of Unicode surrogate pairs, etc.) in the DOM Range API and
 in the ECMAScript string API. See my comment here: 
https://github.com/IDPF/epub-revision/issues/555#issuecomment-144962949

So, although this design approach seems to work pretty well in EPUB3 /
 XHTML5, I wonder whether this is also relevant in the broader Open 
Web Platform context. For example, would a Javascript implementation 
of TextPositionSelector need to translate back and forth between 
Unicode code *points* and UTF16 code *units*, in order for the data to
 flow between the serialization format and the consuming web APIs?

Any other thoughts?

PS, I am "cross-posting" here too 
https://github.com/IDPF/epub-revision/issues/555#issuecomment-241407747

Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/350 using your GitHub 
account

Received on Monday, 22 August 2016 13:24:45 UTC