Re: [css3-text] line break opportunities are based on *syllable* boundaries? from Ambrose LI on 2011-01-29 (www-style@w3.org from January 2011)

From: Ambrose LI <ambrose.li@gmail.com>
Date: Sat, 29 Jan 2011 00:49:38 -0500
To: Andrew Cunningham <andrewc@vicnet.net.au>
Cc: Koji Ishii <kojiishi@gluesoft.co.jp>, "Phillips, Addison" <addison@lab126.com>, Kang-Hao Lu <kennyluck@w3.org>, WWW Style <www-style@w3.org>, WWW International <www-international@w3.org>
Message-ID: <AANLkTi=EKca9ywAN-ZXh8px+CCh3kVfNf-0bfY+B-5r=@mail.gmail.com>
But isn't Koji's example showing exactly that you *don't* really want
to arbitrarily break between syllable boundaries? His example has
three syllables according to Japanese rules. Even by English rules
it's two syllables, not one.

2011/1/28 Andrew Cunningham <andrewc@vicnet.net.au>:
> syllable and grapheme clusters are quite distinct and separate concepts.
>
> I'd argue that you do want syllable boundaries, rather than grapheme
> cluster boundaries. But syllable boundaries are per language constructs,
> based on the phonological and orthographic properties of that language.
>
> While in unicode terms, grapheme clusters hav a more generic definition.
>
> But I doubt that grapheme clusters will give you what you want.
>
> On Fri, January 28, 2011 17:32, Koji Ishii wrote:
>> I'm changing back to the original subject as you seem to be talking about
>> the original topic, not the definition of "word".
>>
>> What I needed here is an appropriate terminology that represents single
>> character within this context:
>>
>>> In several other writing systems, (including Chinese, Japanese, Yi,
>>> and sometimes also Korean) a line break opportunities are based on
>>> *syllable* boundaries, not words.
>>
>> I want "ã‚½ãƒ¼ã‚¹" consists of three, so from what you said, it sounds
>> like "grapheme cluster" is the right choice of words to use here.
>>
>> I agree with you that the definition of "word" is different from grapheme
>> cluster, and I guess answering to that question is even more difficult.
>>
>>
>> Regards,
>> Koji
>>
>> -----Original Message-----
>> From: Phillips, Addison [mailto:addison@lab126.com]
>> Sent: Friday, January 28, 2011 2:22 PM
>> To: Kang-Hao (Kenny) Lu; Koji Ishii
>> Cc: WWW Style; WWW International
>> Subject: RE: What's the definition of a word? (was: [css3-text] line break
>> opportunities are based on *syllable* boundaries?)
>>
>> The term "grapheme cluster" would be wrong for this context. A grapheme
>> cluster is a sequence of logical characters that form a single visual unit
>> of text (what is sometimes perceived as a "character" or "glyph"). This
>> term is used for cases such as an Indic syllable followed by a combining
>> vowel--in which a base character is combined with additional characters to
>> form a single glyph on screen, rather than cases in which separate
>> visual/logical units form a single "word" or "sound". It also applies to
>> cases such as a base letter followed by a combining accent.
>>
>> To help illustrate this, notice that the word "the" is not a grapheme
>> cluster, although it is a single syllable. Notice too that "ã‚½ãƒ¼ã‚¹"
>> consists of *three* graphemes (grapheme clusters), but only two
>> syllables.
>>
>> The relationship of Han ideographs to both "words" and "syllables" is
>> complex and depends both on the language (it is different for Japanese,
>> for example) and on context. It is sometimes true that "ideograph ==
>> syllable" and sometimes also true that "ideograph == word".
>>
>> In any case, the concept of "grapheme cluster" should most definitely not
>> be consider to be synonymous with either "word" or "syllable". It is a
>> distinct unit and may not be *either* in a given context. My understand
>> was that languages written using Han ideographs could be broken anywhere
>> except for certain prescriptive cases (which differ by language). While
>> this might map to some other concept such as syllables, wouldn't it be
>> better to refer specifically to language specific rules?
>>
>> Unicode Standard Annex #14 [1] provides a useful description of
>> line-breaking properties that may be helpful here.
>>
>> Regards,
>>
>> Addison
>>
>> [1] http://www.unicode.org/reports/tr14/
>>
>> Addison Phillips
>> Globalization Architect (Lab126)
>> Chair (W3C I18N, IETF IRI WGs)
>>
>> Internationalization is not a feature.
>> It is an architecture.
>>
>>> -----Original Message-----
>>> From: www-international-request@w3.org [mailto:www-international-
>>> request@w3.org] On Behalf Of Kang-Hao (Kenny) Lu
>>> Sent: Thursday, January 27, 2011 8:43 PM
>>> To: Koji Ishii
>>> Cc: WWW Style; WWW International
>>> Subject: What's the definition of a word? (was: [css3-text] line break
>>> opportunities are based on *syllable* boundaries?)
>>>
>>> > In Chinese, Yi, and Hangul, a character represents a syllable as
>>> far as I understand, but in Japanese, Kanji characters could have more
>>> than one syllable, and also there are cases where multiple characters
>>> represent single syllable (like Kana + prolonged sound mark).
>>> >
>>> > Although this part is not normative, it looks like we should
>>> replace "syllable" with "grapheme cluster".
>>> >
>>> > Please let me know if this change can be incorrect to any other
>>> writing systems listed here than Japanese.
>>>
>>> The situation is similar for Chinese as far as I can tell.
>>>
>>> Speaking about this, this is editorial but the last time I read the
>>> spec, I got a little bit perplexed about the definition of "word".
>>> Is
>>> there a plan to briefly mention what a "word" is in the introduction
>>> section? Or perhaps there should be a glossary that puts "word" and
>>> "grapheme cluster" together? I doubt that there would be a consistent
>>> and precise definition throughout the spec but a brief and non-
>>> normative introduction seems helpful.
>>>
>>>
>>> Cheers,
>>> Kenny
>>
>>
>
>
> --
> Andrew Cunningham
> Research and Development Coordinator
> Vicnet
> State Library of Victoria
> Australia
>
> andrewc@vicnet.net.au
>
>
>



-- 
cheers,
-ambrose

www.xanga.com/little_potato | twitter.com/little_potato
Received on Saturday, 29 January 2011 05:50:13 UTC