Re: [svgwg] Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes. from r12a via GitHub on 2018-10-02 (public-svg-issues@w3.org from October 2018)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Tue, 02 Oct 2018 13:10:03 +0000
To: public-svg-issues@w3.org
Message-ID: <issue_comment.created-426267210-1538485796-sysbot+gh@w3.org>

Sorry to be late to the party. (Btw, if you add an i18n-tracking label to an issue, it should pop up in our WG daily notifications, so others may have seen while i was travelling. That may help next time.)

This is an interesting discussion. I don't think i have a clear answer for you, but i may be able to help a little. You may have found it useful to refer to some new material, added recently to one of our articles, that describes code points vs grapheme clusters vs typographic character units, however i think you probably understand most of that stuff now. Note in particular that i believe you have correctly identified that the CSS typographic character unit is very contextually dependent. Here are a few other thoughts from me, off the top of my head...

First, i think it was always a BAD MISTAKE to ever define strings in terms of UTF-16 code units. In order to apply offsets per Amelia's test you'd have to be aware of which characters were supplementary chars and which weren't in order to create a step of characters. The same applies for counting things, since you never want to separate the two UTF-16 code units that make up a single character.

If you go with grapheme clusters, users may still get some odd effects unexpectedly. Take the following example in Bangla: kshī (ক্ষি) is made up of _two_ grapheme clusters. If you were creating Amelia's stepped character display, you'd end up with

![screen shot 2018-10-02 at 13 44 53](https://user-images.githubusercontent.com/4839211/46349217-5f627380-c649-11e8-9c36-18beee84f0c7.png)

rather than all grouped together like

![screen shot 2018-10-02 at 13 43 15](https://user-images.githubusercontent.com/4839211/46349222-65585480-c649-11e8-877e-35a9d15eb29c.png)

The reason this isn't taken care of by Unicode grapheme cluster rules is that it's tricky. What constitutes a user-perceived character in this case depends on which script is being used, and to an extent on what the font does too, since it's only a single user-perceived character if the sequence forms a conjunct (ie. the glyphs are combined into a unit).

Apart from that, I'd certainly like to be able to highlight code points sometimes rather than grapheme clusters - eg. when colouring diacritics or other combining characters in educational material, or even sometimes when explaining grapheme clusters to be able to colour each component part differently!

Of course, one encounters similar problems with code points. The stepped character display would look even worse if it showed up as

![screen shot 2018-10-02 at 13 51 41](https://user-images.githubusercontent.com/4839211/46349582-56be6d00-c64a-11e8-8559-3441cebe917c.png)

On the other hand, if you wanted to explain to someone what characters make up that conjunct (perhaps with horizontal movement rather than vertical) this could be quite useful.

It seems to me that perhaps a stepped character display like Amelia's test would probably always need to be hand crafted, so that the right things stick together(?)

However, counting characters is perhaps something else. As i said before, i wouldn't want to use UTF-16 code units for counting, any more than i'd use bytes. I also think that grapheme cluster counts don't give enough precision for some use cases, and it's possible that the rules for what constitute a grapheme cluster may be extended too in the future. I think that code points are probably the best way to go.

As far as emoji go, here we are entering a world where the question of what constitutes a unit becomes even further complicated. This is because an emoji picture can be made up of many component parts. Perhaps a useful example can be found in the slides i just put together for Paris Web – see the juggling girl and family emojis at https://www.w3.org/International/talks/1810-paris/index.html#truncation.

![screen shot 2018-10-02 at 14 08 39](https://user-images.githubusercontent.com/4839211/46350435-b0c03200-c64c-11e8-8496-200136e45f1f.png)

I don't know how helpful all that is, but hopefully a little.

--
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/svgwg/issues/537#issuecomment-426267210 using your GitHub account

Received on Tuesday, 2 October 2018 13:10:10 UTC