- From: r12a via GitHub <sysbot+gh@w3.org>
- Date: Tue, 29 Aug 2023 16:18:46 +0000
- To: public-i18n-archive@w3.org
> Generally speaking, most text navigation and editing follows the user-perceived character boundaries. For most implementations this corresponds to Unicode's definition of "default extended grapheme cluster" boundaries [[UAX29](https://unicode.org/reports/tr29/)]. The main exception to this is backspacing, which usually follows Unicode code point boundaries in the underlying encoded text (although there are exceptions to this). For the simplest scripts and languages, these often amount to the same thing. This and other parts of the document strike me as over-simplified and in places incorrect, but there are terminological problems (which we are already familiar with) that cloud the issue. My experience in working with these things has lead me to view the world in terms of code points, which are grouped into grapheme clusters, which are in turn grouped into orthographic syllables. (I'm in the process of writing that up more clearly, elsewhere...) I'm inclined to agree with Norbert that this idea of user-perceived character boundaries is too vague and not clearly substantiated enough to be used as the name of a unit of segmentation. Rather it's merely a way of helping people imagine why code point units are not sufficient _in some cases_. The distinction between grapheme clusters and orthographic syllables is not informed by it's used, but is crucial in the information provided by this article. My experience has shown that browsers use these 3 different units for text operations such as cursor movement and deletion, depending on the language, and sometimes inconsistently within a single language, but also from browser to browser. I've been investigating this and writing up results for the various browsers in my orthography notes, under the section "Graphemes". It may be worth going to https://r12a.github.io/scripts/switch.html and selecting the 'graphemes' segment id, then cycling throught the orthographies using the control "Select an orthography". You should especially look for the subheading "Browser behaviour", where it exists, to find the results per browser. (I was wondering whether it would be useful to list behaviour against orthography in a table of some sort – not necessarily in this article, but somewhere.) That said, it's not clear to me what is your source of authoritative information about how cursoring and deletion should work. I don't think that it is made clear in the UAX how things should work, but is rather left up to the application to decide the exact mechanism.(?) Or are you meaning to describe what browsers currently do? I think it would be good to make that much clearer. I also think that the article should make it much clearer (actually, i think it's hardly mentioned at all other than for one Thai example) that very different segmentation rules may apply for other operations on the text, such as line breaking, justification, text spacing, and the like – and that this is not an issue, but is useful. The exceptions section alludes to the importance of orthographic syllables, but this isn't really an exception - even in terms of current browser support. Again it varies by browsers and by orthography, but it's something that needs to be mentioned either together with or given equal importance to the section entitled "Combining characters". -- GitHub Notification of comment by r12a Please view or discuss this issue at https://github.com/w3c/i18n-drafts/pull/520#issuecomment-1697758649 using your GitHub account -- Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Tuesday, 29 August 2023 16:18:49 UTC