Re: [web-annotation] (model) vague definition of charactor position for text position selector from r12a via GitHub on 2016-05-05 (public-annotation@w3.org from May 2016)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Thu, 05 May 2016 13:00:04 +0000
To: public-annotation@w3.org
Message-ID: <issue_comment.created-217145856-1462453203-sysbot+gh@w3.org>

I expect that the i18n WG will discuss this and provide a more formal 
answer. In the meantime, maybe this can help: 
https://www.w3.org/International/techniques/developing-specs#char_string
 and
https://www.w3.org/International/techniques/developing-specs#char_indexing
(follow the 'more' links for additional information, where needed, for
 rationales and explanations).

The above links make the fundamental point that text pointers should 
use character boundaries, not bytes.  Having said that, because of 
backwards compatibility requirements, Unicode often allows two 
canonically equivalent forms such as U+00E1 LATIN SMALL LETTER A WITH 
ACUTE vs. U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING 
ACUTE ACCENT.  So there are cases where if you are matching text 
containing á you'd want to normalise the representation (usually to a 
precomposed form) to make the match work.

If you are simply pointing to a position in the text, however, i'm not
 sure that you need to normalise.  On the other hand, you may want to 
take into account the fact that U+0061 LATIN SMALL LETTER A followed 
by U+0301 COMBINING ACUTE ACCENT is not something that ought to be 
split by a selection. For this, you need to consider the text as a 
series of grapheme clusters.

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/206#issuecomment-217145856
 using your GitHub account

Received on Thursday, 5 May 2016 13:00:12 UTC