- From: Andrew Cunningham <andrewc@vicnet.net.au>
- Date: Sat, 29 Jan 2011 18:28:04 +1100
- To: "Ambrose LI" <ambrose.li@gmail.com>
- Cc: "Andrew Cunningham" <andrewc@vicnet.net.au>, "Koji Ishii" <kojiishi@gluesoft.co.jp>, "Phillips, Addison" <addison@lab126.com>, "Kang-Hao Lu" <kennyluck@w3.org>, "WWW Style" <www-style@w3.org>, "WWW International" <www-international@w3.org>
For me, character/grapheme cluster boundary breaking is a last resort fall back option. Best option is language specific word boundary identification, a less preferred option is language specific syllable boundary identification, last resort is character/grapheme cluster boundary identification. On Sat, January 29, 2011 16:49, Ambrose LI wrote: > But isn't Koji's example showing exactly that you *don't* really want > to arbitrarily break between syllable boundaries? His example has > three syllables according to Japanese rules. Even by English rules > it's two syllables, not one. > > 2011/1/28 Andrew Cunningham <andrewc@vicnet.net.au>: >> syllable and grapheme clusters are quite distinct and separate concepts. >> >> I'd argue that you do want syllable boundaries, rather than grapheme >> cluster boundaries. But syllable boundaries are per language constructs, >> based on the phonological and orthographic properties of that language. >> >> While in unicode terms, grapheme clusters hav a more generic definition. >> >> But I doubt that grapheme clusters will give you what you want. >> >> On Fri, January 28, 2011 17:32, Koji Ishii wrote: >>> I'm changing back to the original subject as you seem to be talking >>> about >>> the original topic, not the definition of "word". >>> >>> What I needed here is an appropriate terminology that represents single >>> character within this context: >>> >>>> In several other writing systems, (including Chinese, Japanese, Yi, >>>> and sometimes also Korean) a line break opportunities are based on >>>> *syllable* boundaries, not words. >>> >>> I want "ソース" consists of three, so from what you said, >>> it sounds >>> like "grapheme cluster" is the right choice of words to use here. >>> >>> I agree with you that the definition of "word" is different from >>> grapheme >>> cluster, and I guess answering to that question is even more difficult. >>> >>> >>> Regards, >>> Koji >>> >>> -----Original Message----- >>> From: Phillips, Addison [mailto:addison@lab126.com] >>> Sent: Friday, January 28, 2011 2:22 PM >>> To: Kang-Hao (Kenny) Lu; Koji Ishii >>> Cc: WWW Style; WWW International >>> Subject: RE: What's the definition of a word? (was: [css3-text] line >>> break >>> opportunities are based on *syllable* boundaries?) >>> >>> The term "grapheme cluster" would be wrong for this context. A grapheme >>> cluster is a sequence of logical characters that form a single visual >>> unit >>> of text (what is sometimes perceived as a "character" or "glyph"). This >>> term is used for cases such as an Indic syllable followed by a >>> combining >>> vowel--in which a base character is combined with additional characters >>> to >>> form a single glyph on screen, rather than cases in which separate >>> visual/logical units form a single "word" or "sound". It also applies >>> to >>> cases such as a base letter followed by a combining accent. >>> >>> To help illustrate this, notice that the word "the" is not a grapheme >>> cluster, although it is a single syllable. Notice too that >>> "ソース" >>> consists of *three* graphemes (grapheme clusters), but only two >>> syllables. >>> >>> The relationship of Han ideographs to both "words" and "syllables" is >>> complex and depends both on the language (it is different for Japanese, >>> for example) and on context. It is sometimes true that "ideograph == >>> syllable" and sometimes also true that "ideograph == word". >>> >>> In any case, the concept of "grapheme cluster" should most definitely >>> not >>> be consider to be synonymous with either "word" or "syllable". It is a >>> distinct unit and may not be *either* in a given context. My understand >>> was that languages written using Han ideographs could be broken >>> anywhere >>> except for certain prescriptive cases (which differ by language). While >>> this might map to some other concept such as syllables, wouldn't it be >>> better to refer specifically to language specific rules? >>> >>> Unicode Standard Annex #14 [1] provides a useful description of >>> line-breaking properties that may be helpful here. >>> >>> Regards, >>> >>> Addison >>> >>> [1] http://www.unicode.org/reports/tr14/ >>> >>> Addison Phillips >>> Globalization Architect (Lab126) >>> Chair (W3C I18N, IETF IRI WGs) >>> >>> Internationalization is not a feature. >>> It is an architecture. >>> >>>> -----Original Message----- >>>> From: www-international-request@w3.org [mailto:www-international- >>>> request@w3.org] On Behalf Of Kang-Hao (Kenny) Lu >>>> Sent: Thursday, January 27, 2011 8:43 PM >>>> To: Koji Ishii >>>> Cc: WWW Style; WWW International >>>> Subject: What's the definition of a word? (was: [css3-text] line break >>>> opportunities are based on *syllable* boundaries?) >>>> >>>> > In Chinese, Yi, and Hangul, a character represents a syllable as >>>> far as I understand, but in Japanese, Kanji characters could have more >>>> than one syllable, and also there are cases where multiple characters >>>> represent single syllable (like Kana + prolonged sound mark). >>>> > >>>> > Although this part is not normative, it looks like we should >>>> replace "syllable" with "grapheme cluster". >>>> > >>>> > Please let me know if this change can be incorrect to any other >>>> writing systems listed here than Japanese. >>>> >>>> The situation is similar for Chinese as far as I can tell. >>>> >>>> Speaking about this, this is editorial but the last time I read the >>>> spec, I got a little bit perplexed about the definition of "word". >>>> Is >>>> there a plan to briefly mention what a "word" is in the introduction >>>> section? Or perhaps there should be a glossary that puts "word" and >>>> "grapheme cluster" together? I doubt that there would be a consistent >>>> and precise definition throughout the spec but a brief and non- >>>> normative introduction seems helpful. >>>> >>>> >>>> Cheers, >>>> Kenny >>> >>> >> >> >> -- >> Andrew Cunningham >> Research and Development Coordinator >> Vicnet >> State Library of Victoria >> Australia >> >> andrewc@vicnet.net.au >> >> >> > > > > -- > cheers, > -ambrose > > www.xanga.com/little_potato | twitter.com/little_potato > -- Andrew Cunningham Research and Development Coordinator Vicnet State Library of Victoria Australia andrewc@vicnet.net.au
Received on Saturday, 29 January 2011 07:28:42 UTC