- From: r12a via GitHub <sysbot+gh@w3.org>
- Date: Thu, 05 May 2016 13:00:04 +0000
- To: public-annotation@w3.org
I expect that the i18n WG will discuss this and provide a more formal answer. In the meantime, maybe this can help: https://www.w3.org/International/techniques/developing-specs#char_string and https://www.w3.org/International/techniques/developing-specs#char_indexing (follow the 'more' links for additional information, where needed, for rationales and explanations). The above links make the fundamental point that text pointers should use character boundaries, not bytes. Having said that, because of backwards compatibility requirements, Unicode often allows two canonically equivalent forms such as U+00E1 LATIN SMALL LETTER A WITH ACUTE vs. U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT. So there are cases where if you are matching text containing รก you'd want to normalise the representation (usually to a precomposed form) to make the match work. If you are simply pointing to a position in the text, however, i'm not sure that you need to normalise. On the other hand, you may want to take into account the fact that U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT is not something that ought to be split by a selection. For this, you need to consider the text as a series of grapheme clusters. -- GitHub Notification of comment by r12a Please view or discuss this issue at https://github.com/w3c/web-annotation/issues/206#issuecomment-217145856 using your GitHub account
Received on Thursday, 5 May 2016 13:00:12 UTC