Re: [css3-text] line break opportunities are based on *syllable* boundaries? from Andrew Cunningham on 2011-01-29 (www-international@w3.org from January to March 2011)

From: Andrew Cunningham <andrewc@vicnet.net.au>
Date: Sat, 29 Jan 2011 18:28:04 +1100
To: "Ambrose LI" <ambrose.li@gmail.com>
Cc: "Andrew Cunningham" <andrewc@vicnet.net.au>, "Koji Ishii" <kojiishi@gluesoft.co.jp>, "Phillips, Addison" <addison@lab126.com>, "Kang-Hao Lu" <kennyluck@w3.org>, "WWW Style" <www-style@w3.org>, "WWW International" <www-international@w3.org>
Message-ID: <17073f74e73b299690f2b598d634d465.squirrel@mail.vicnet.net.au>
For me, character/grapheme cluster boundary breaking is a last resort fall
back option.

Best option is language specific word boundary identification, a less
preferred option is language specific syllable boundary identification,
last resort is character/grapheme cluster boundary identification.

On Sat, January 29, 2011 16:49, Ambrose LI wrote:
> But isn't Koji's example showing exactly that you *don't* really want
> to arbitrarily break between syllable boundaries? His example has
> three syllables according to Japanese rules. Even by English rules
> it's two syllables, not one.
>
> 2011/1/28 Andrew Cunningham <andrewc@vicnet.net.au>:
>> syllable and grapheme clusters are quite distinct and separate concepts.
>>
>> I'd argue that you do want syllable boundaries, rather than grapheme
>> cluster boundaries. But syllable boundaries are per language constructs,
>> based on the phonological and orthographic properties of that language.
>>
>> While in unicode terms, grapheme clusters hav a more generic definition.
>>
>> But I doubt that grapheme clusters will give you what you want.
>>
>> On Fri, January 28, 2011 17:32, Koji Ishii wrote:
>>> I'm changing back to the original subject as you seem to be talking
>>> about
>>> the original topic, not the definition of "word".
>>>
>>> What I needed here is an appropriate terminology that represents single
>>> character within this context:
>>>
>>>> In several other writing systems, (including Chinese, Japanese, Yi,
>>>> and sometimes also Korean) a line break opportunities are based on
>>>> *syllable* boundaries, not words.
>>>
>>> I want "Ã£â€šÂ½Ã£Æ’Â¼Ã£â€šÂ¹" consists of three, so from what you said,
>>> it sounds
>>> like "grapheme cluster" is the right choice of words to use here.
>>>
>>> I agree with you that the definition of "word" is different from
>>> grapheme
>>> cluster, and I guess answering to that question is even more difficult.
>>>
>>>
>>> Regards,
>>> Koji
>>>
>>> -----Original Message-----
>>> From: Phillips, Addison [mailto:addison@lab126.com]
>>> Sent: Friday, January 28, 2011 2:22 PM
>>> To: Kang-Hao (Kenny) Lu; Koji Ishii
>>> Cc: WWW Style; WWW International
>>> Subject: RE: What's the definition of a word? (was: [css3-text] line
>>> break
>>> opportunities are based on *syllable* boundaries?)
>>>
>>> The term "grapheme cluster" would be wrong for this context. A grapheme
>>> cluster is a sequence of logical characters that form a single visual
>>> unit
>>> of text (what is sometimes perceived as a "character" or "glyph"). This
>>> term is used for cases such as an Indic syllable followed by a
>>> combining
>>> vowel--in which a base character is combined with additional characters
>>> to
>>> form a single glyph on screen, rather than cases in which separate
>>> visual/logical units form a single "word" or "sound". It also applies
>>> to
>>> cases such as a base letter followed by a combining accent.
>>>
>>> To help illustrate this, notice that the word "the" is not a grapheme
>>> cluster, although it is a single syllable. Notice too that
>>> "Ã£â€šÂ½Ã£Æ’Â¼Ã£â€šÂ¹"
>>> consists of *three* graphemes (grapheme clusters), but only two
>>> syllables.
>>>
>>> The relationship of Han ideographs to both "words" and "syllables" is
>>> complex and depends both on the language (it is different for Japanese,
>>> for example) and on context. It is sometimes true that "ideograph ==
>>> syllable" and sometimes also true that "ideograph == word".
>>>
>>> In any case, the concept of "grapheme cluster" should most definitely
>>> not
>>> be consider to be synonymous with either "word" or "syllable". It is a
>>> distinct unit and may not be *either* in a given context. My understand
>>> was that languages written using Han ideographs could be broken
>>> anywhere
>>> except for certain prescriptive cases (which differ by language). While
>>> this might map to some other concept such as syllables, wouldn't it be
>>> better to refer specifically to language specific rules?
>>>
>>> Unicode Standard Annex #14 [1] provides a useful description of
>>> line-breaking properties that may be helpful here.
>>>
>>> Regards,
>>>
>>> Addison
>>>
>>> [1] http://www.unicode.org/reports/tr14/
>>>
>>> Addison Phillips
>>> Globalization Architect (Lab126)
>>> Chair (W3C I18N, IETF IRI WGs)
>>>
>>> Internationalization is not a feature.
>>> It is an architecture.
>>>
>>>> -----Original Message-----
>>>> From: www-international-request@w3.org [mailto:www-international-
>>>> request@w3.org] On Behalf Of Kang-Hao (Kenny) Lu
>>>> Sent: Thursday, January 27, 2011 8:43 PM
>>>> To: Koji Ishii
>>>> Cc: WWW Style; WWW International
>>>> Subject: What's the definition of a word? (was: [css3-text] line break
>>>> opportunities are based on *syllable* boundaries?)
>>>>
>>>> > In Chinese, Yi, and Hangul, a character represents a syllable as
>>>> far as I understand, but in Japanese, Kanji characters could have more
>>>> than one syllable, and also there are cases where multiple characters
>>>> represent single syllable (like Kana + prolonged sound mark).
>>>> >
>>>> > Although this part is not normative, it looks like we should
>>>> replace "syllable" with "grapheme cluster".
>>>> >
>>>> > Please let me know if this change can be incorrect to any other
>>>> writing systems listed here than Japanese.
>>>>
>>>> The situation is similar for Chinese as far as I can tell.
>>>>
>>>> Speaking about this, this is editorial but the last time I read the
>>>> spec, I got a little bit perplexed about the definition of "word".
>>>> Is
>>>> there a plan to briefly mention what a "word" is in the introduction
>>>> section? Or perhaps there should be a glossary that puts "word" and
>>>> "grapheme cluster" together? I doubt that there would be a consistent
>>>> and precise definition throughout the spec but a brief and non-
>>>> normative introduction seems helpful.
>>>>
>>>>
>>>> Cheers,
>>>> Kenny
>>>
>>>
>>
>>
>> --
>> Andrew Cunningham
>> Research and Development Coordinator
>> Vicnet
>> State Library of Victoria
>> Australia
>>
>> andrewc@vicnet.net.au
>>
>>
>>
>
>
>
> --
> cheers,
> -ambrose
>
> www.xanga.com/little_potato | twitter.com/little_potato
>


-- 
Andrew Cunningham
Research and Development Coordinator
Vicnet
State Library of Victoria
Australia

andrewc@vicnet.net.au
Received on Saturday, 29 January 2011 07:30:58 UTC