Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters from fantasai on 2014-06-25 (www-international@w3.org from April to June 2014)

From: fantasai <fantasai.lists@inkedblade.net>
Date: Wed, 25 Jun 2014 08:11:03 -0700
To: Richard Ishida <ishida@w3.org>, www-international@w3.org
CC: "CSS WWW Style (www-style@w3.org)" <www-style@w3.org>
Message-ID: <53AAE687.1010403@inkedblade.net>

On 05/22/2014 11:09 AM, Richard Ishida wrote:
>
> One is that, as I mentioned already, it is not correct to say 'the "user-perceived character", also know as the grapheme
> cluster.'  The equivalent term for a user-percieved character is 'grapheme'.  The 'grapheme cluster' is a unit derived from
> rules in Unicode to yield an *approximation* to a user-defined character.  Not all user-perceived characters are grapheme
> clusters.

I'm fine to remove that phrase if it's problematic.
Is it problematic in UAX29 also? (Does it need a bug filed there?)

> Another is a worry whether we can really effectively split
> the world into semantically-perceived and visually-perceived
> characters - especially given the 'etc' that appears in the
> definition where we list appropriate operations for each.
> For example, are we sure that first-letter operations require
> semantically- rather than visually-perceived characters in all
> cases?  Where does cursor movement fit here? etc.

I think I have to conclude that no, we can't.

> What about Arabic justification which may involve increasing
> word -internal 'gaps' that occur due to one glyph not joining
> with the following glyph. These are relevant units for
> justification of Arabic text, but they aren't user-perceived
> characters.

Is that really a relevant concept? Increasing word-internal
'gaps' is a horrible way to justify Arabic text, look:
   http://dev.w3.org/csswg/css-text/arabic-stretch-unjoined
It results in uneven typographic color and obscures word
boundaries. It might exist, but I've never seen it...

> And what about the case where Indic script text units vary
> according to the font in use.  As I understand it, a text
> unit for wrapping or stretching in Devanagari can encompass
> a CvCVD (consonant, virama, consonant, vowel sign, diacritic)
> only if the font has glyphs to show this is a single visual
> unit (eg. ligatures, half-forms, special glyphs) and hides
> the virama. If the font is changed, such that the virama
> becomes visible, we are now dealing with two text units.
> This font-specific behaviour for the same sequence of code
> points is a contextual difference that, I think, cuts across
> both the semantic- and visual- categories currently defined.

Okay.

> I think that actually all we may be trying to say is that
> the atomic unit of text for a particular operation may not
> be the same as for another, but that we start from a base
> of grapheme clusters and require the application to take
> into account variances and extensions of that as needed.
> What if we simply talk in terms of vague 'typographic units',
> or 'text units', or some such, but describe up front how
> these can be different sequences of code points depending
> on the operation to be performed (ie. not try to define
> just two specific scenarios)?

Overall, I agree with the concept, but I want to make sure that
the spec is somehow understandable to people who are not either
   a) members of the i18nWG or a similar community
   b) text layout implementation experts

(If Lea Verou cannot make sense of the CSS Text spec well enough
to use it as a reference for the properties it defines, then I
consider the spec to be a failure.)

I've reworked the Terminology section following your suggestions:
will work on the rest of the spec tomorrow and hopefully have it
all make sense soon. ^_^

~fantasai

Received on Wednesday, 25 June 2014 15:11:41 UTC