Re: [web-annotation] TextPositionSelector, thoughts about Unicode code *point* vs. UTF16 code *unit* from Daniel Weck via GitHub on 2016-08-23 (public-annotation@w3.org from August 2016)

From: Daniel Weck via GitHub <sysbot+gh@w3.org>
Date: Tue, 23 Aug 2016 10:33:34 +0000
To: public-annotation@w3.org
Message-ID: <issue_comment.created-241691761-1471948412-sysbot+gh@w3.org>

Thanks @tkanai I agree that higher-level processing on Unicode 'code 
points' basis has its benefits (notably, text selections / character 
ranges are functionally closer to how human-readable languages / 
scripts are structured), but I was wondering about implementation 
feasibility and costs (in particular: performance).

The use of UTF16 'code units' in EPUB3 CFI is consistent with the 
overall "low level" design (e.g. canonical syntax for XML element path
 based on numbered node references). So yes, CFI character ranges are 
totally unaware of Unicode "subtleties" such as grapheme clusters and 
surrogate pairs, which means that a CFI-authoring user interface must 
capture and constrain/adjust text selections in such a way that they 
make logical sense from the user's perspective (whilst the underlying 
CFI processor itself does not need to be "Unicode aware" to that 
degree). Web browsers implement high-level text selection pretty well 
already, so the responsibility of a typical CFI processing library 
basically boils down to handling the low-level UTF16-aware (UCS2) 
output from DOM Ranges or JavaScript string API (no need for 
sophisticated Punycode -like Unicode utilities).

So, I am by no means claiming that the CFI model is applicable / 
superior to TextPositionSelector, I am just wondering about the pros 
and cons s :)

-- 
GitHub Notification of comment by danielweck
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/350#issuecomment-241691761
 using your GitHub account

Received on Tuesday, 23 August 2016 10:33:42 UTC