W3C home > Mailing lists > Public > www-international@w3.org > April to June 2014

Re: [css-text] I18N-ISSUE-308: Definition of 'grapheme cluster'

From: fantasai <fantasai.lists@inkedblade.net>
Date: Sat, 10 May 2014 11:19:45 -0700
Message-ID: <536E6DC1.5010508@inkedblade.net>
To: "Phillips, Addison" <addison@lab126.com>, Koji Ishii <kojiishi@gluesoft.co.jp>
CC: "CSS WWW Style (www-style@w3.org)" <www-style@w3.org>, www International <www-international@w3.org>
On 04/20/2014 02:41 PM, Phillips, Addison wrote:
>> Referring to UAX#29 here is a good idea, but could you confirm your intention
>> of the suggested change?
>
> The concern here was that the statement as written is exceedingly vague.
> There are many "typographic traditions" as there are many languages and
> scripts. Some guidance on what to do seemed warranted.

I think because we don't have any more specific guidance, the statement
must remain exceedingly vague.

>> * “further tailor” to “extend grapheme cluster boundaries” looks like you’re
>> suggesting to prohibit shrinking grapheme cluster boundaries, but I suppose it’s
>> not your intention, is it? Isn’t “tailor” more appropriate word to use here, in
>> terms of giving more flexibilities to implementers, and it’s the word widely
>> used in UAX#29?
>
> In the main, we do mean "extend", since that what usually needs to happen.
> I can't, off hand, think of a case where the cluster is reduced in size,
> but that doesn't mean there isn't one. Tailor, as a result, is probably
> the better word choice.

Thai, for the purposes of letter-spacing (though not for line-breaking),
does things other than extending the definitions in UAX29, as illustrated
in the example.

>> * Is your intention of adding “as identified by the content’s language” to
>> prohibit tailoring unless content language is specified? My thought was that it’s
>> better not to have such restrictions from I18N perspective. Do I misunderstand
>> your suggestion?
>
> Different languages and cultures have different "typographic traditions".
> So there needed to be some kind of indication in the text about what a
> "typographic tradition" is and how to apply it.

typographic - of or pertaining to typography (the art, craft, or process
   of composing type and printing from it)

tradition - a customary or characteristic method or manner

I think that should cover what a "typographic tradition" is reasonably
well? Since we, in general, do not provide references to the English
dictionary for terminology used per dictionary, I don't think further
clarification is necessary.

As for how to apply it, do you mean in a technical sense? For example,
a UA could use a font feature to do letter-spacing. Or it could slice
and dice the glyphs as needed. Or something else. But how it does it
is out-of-scope for CSS to define at the moment (because it depends
on the architecture of the typographic system underlying the layout
engine) so we can't be adding prose to pin that down.

> Since these traditions are linked to different languages or cultures
> (and are neither wholly generalized nor can they be inferred solely
> from the script/codepoints in the text), the user-agent needs to
> infer them from available data in the page/text, probably from
> language tags (if any exist). In the absence of language tags, it is
> still possible to apply language-specific tailoring (by guessing the
> language or assuming some default).

Sure. Some things can be handled at the codepoint level and some
things need language information to choose among various behaviors.
The text is not contradicting this.

> The goal was not to prohibit or restrict grapheme boundary tailoring,
> but to provide some way for implementations to connect code to content.
> Otherwise I read this sentence as saying, basically, "The UA can split
> the text wherever it feels it is convenient to do so and no guarantee
> of interoperability of selection is provided."

The UA can split the text however it makes sense to do so according to
a typographic tradition (that is pre-existing; it can't make one up,
else it wouldn't be a tradition). No, indeed, no guarantee of interop
is provided, if there is more than one traditional behavior for a set
of codepoints. However, the UA can't do whatever it wants. It has to
conform to either UAX29 or some pre-existing typographic tradition.

> In the ideal case, UAX#29 would supply a complete description of
> grapheme boundary selection, including tailorings (perhaps via CLDR)
> and we would just point there. In the absence of that, it makes sense
> to me to try to enforce a certain level of interoperability, while
> permitting the development of better text segmentation, particularly
> in some of the Indic scripts that are known to have unaddressed
> corner cases.

Right, and since we don't have exact answers, the spec says here's a
basic algorithm in UAX29, please tailor it as appropriate to the best
of your knowledge.

And that's really the best we can do. Unless you have a reference
providing more specific information that we can point to.

~fantasai
Received on Saturday, 10 May 2014 18:20:13 UTC

This archive was generated by hypermail 2.3.1 : Saturday, 10 May 2014 18:20:14 UTC