Re: [css3-text] line break opportunities are based on *syllable* boundaries? from Ambrose LI on 2011-01-29 (www-international@w3.org from January to March 2011)

From: Ambrose LI <ambrose.li@gmail.com>
Date: Sat, 29 Jan 2011 13:48:29 -0500
To: CE Whitehead <cewcathar@hotmail.com>
Cc: kojiishi@gluesoft.co.jp, addison@lab126.com, kennyluck@w3.org, www-style@w3.org, www-international@w3.org
Message-ID: <AANLkTikhHHJzfnb_LkAnWCO9Wh0XPzoFRoTMqMoBjpUi@mail.gmail.com>
Hi,

It is true that we don’t use zero-width spaces, at all (most probably
because of historical reasons, but also, probably, because it can’t be
typed); but I doubt that even “lexical resources” will work,
especially in the case of Chinese. (It possibly *might* work in
Japanese, with the help of a good parser; but still I’d say it will
never work in all cases.)

IMHO, the only thing CSS can realistically do is to provide a way for
authors to specify “this group of characters is a word / personal
name; don’t break it!” As far as I can tell, there’s no such mechanism
in place.

(And you are right that even in English you can’t really hyphenate
between any two syllables; from what I’ve read neither can you do so
in French.)

2011/1/29 CE Whitehead <cewcathar@hotmail.com>:
> Hi, regarding the text:
>
> "For most scripts, in the absence of hyphenation a line break occurs only at
> word boundaries. Many writing systems use spaces or punctuation to
> explicitly separate words, and line break opportunities can be identified by
> these characters. Scripts such as Thai, Lao, and Khmer, however, do not use
> spaces or punctuation to separate words. Although the zero width space
> (U+200B) can be used as an explicit word delimiter in these scripts, this
> practice is not common. As a result, a lexical resource is needed to
> correctly identify break points in such texts. "
>
> I believe it is correct to say "word" here (not "syllable"), but don't know
> what to do about languages that do not use word deliminters, and can provide
> no references for Korean, Japanese, or Chinese (though yes a lexical
> resource seems best).
>
> You might qualify your text by saying "in most languages a break occurs at
> word boundaries."
>
> Perhaps this is off-topic -- but if resources for Arabic are of use to you
> -- I do believe that the Arabic language does not normally permit word
> hyphenation -- that is you have to break between words normally.
> (Correct me if I am wrong; also there may be some instances where you can
> hyphenate in Arabic script;
> http://omega.enstb.org/yannis/pdf/marrakech.pdf who says Arabic script
> "namely Uighur" permits hyphenation but not the Arabic language;
> I also googled and got
> http://www.tug.org/TUGboat/Articles/tb27-2/tb87benatia.pdf which may be a
> suitable reference for Arabic; this text does show instances where words are
> hyphenated -- the rest of the hyphenated word may be placed in the margin;
> since I do some strange things in English handwriting this looks o.k. to me
> but I am not an expert on Arabic hyphenation; I have not seen hyphenation in
> Arabic elsewhere except in this reference.)
>
> (What I know about English which is not in question:  syllables can be
> sometimes separated with hyphens in English; however not all syllables
> can be separated with hyphens [traditionally before computers typesetters
> were not supposed to break between -ing or -ed to my knowledge; the
> newspapers violated this rule however but they broke words at any character
> and also misspelled in those days]).
>
> One other note; the text that follows the quoted text has a typo:
>
> "In several other writing systems, (including Chinese, Japanese, Yi, and
> sometimes also Korean) a line break opportunities are based on syllable
> boundaries, not words"
>
> =>
> "a line break opportunity" is based  {COMMENT: optionally, "a line break
> opportunity occurs at a syllable boundary rather than at a word boundary." }
>
> or
>
> =>
>
> "line break opportunities are based"
>
>
> { "based on" is o.k. here; I just am more used to "occur;" "line breaks are
> based on" conveys the sense that you are using these boundaries to make a
> rule; "line breaks occur at" conveys the sense that this is the rule; so
> perhaps you want to keep with "based on." }
>
>
> Best,
>
> --C. E. Whitehead
> cewcathar@hotmail.com
>
>
>> From: kojiishi@gluesoft.co.jp
>> To: addison@lab126.com; kennyluck@w3.org
>> CC: www-style@w3.org; www-international@w3.org
>> Date: Fri, 28 Jan 2011 01:55:07 -0500
>> Subject: RE: [css3-text] line break opportunities are based on *syllable*
>> boundaries?
>>
>> I was also thinking to replace "not words" with something else given
>> Kenny's feedback, but as I re-read the 5. Line Breaking and Word Boundaries
>> section[1] from the first paragraph, it looks like the "word" is pretty well
>> defined within this context.
>>
>> > For most scripts, in the absence of hyphenation
>> > a line break occurs only at word boundaries.
>> > Many writing systems use spaces or punctuation
>> > to explicitly separate words, and line break
>> > opportunities can be identified by these characters.
>> > Scripts such as Thai, Lao, and Khmer, however, do
>> > not use spaces or punctuation to separate words.
>> > Although the zero width space (U+200B) can be used
>> > as an explicit word delimiter in these scripts, this
>> > practice is not common. As a result, a lexical resource
>> > is needed to correctly identify break points in such texts.
>> >
>> > In several other writing systems, (including Chinese,
>> > Japanese, Yi, and sometimes also Korean) a line break
>> > opportunities are based on syllable boundaries, not words.
>>
>> So just changing "syllable" to "grapheme cluster" looks good enough to me.
>>
>> What Kenny is asking is probably a generic definition of the "word"
>> regardless of the context. I haven't come up with a good idea, I'd be happy
>> to hear if any good way to do so. But at least for this portion of the text,
>> I think the "word" is well defined.
>>
>> [1] http://dev.w3.org/csswg/css3-text/#line-breaking
>>
>>
>> Regards,
>> Koji
>>
>> -----Original Message-----
>> From: www-international-request@w3.org
>> [mailto:www-international-request@w3.org] On Behalf Of Koji Ishii
>> Sent: Friday, January 28, 2011 3:33 PM
>> To: Phillips, Addison; Kang-Hao (Kenny) Lu
>> Cc: WWW Style; WWW International
>> Subject: RE: [css3-text] line break opportunities are based on *syllable*
>> boundaries?
>>
>> I'm changing back to the original subject as you seem to be talking about
>> the original topic, not the definition of "word".
>>
>> What I needed here is an appropriate terminology that represents single
>> character within this context:
>>
>> > In several other writing systems, (including Chinese, Japanese, Yi,
>> > and sometimes also Korean) a line break opportunities are based on
>> > *syllable* boundaries, not words.
>>
>> I want "ソース" consists of three, so from what you said, it sounds like
>> "grapheme cluster" is the right choice of words to use here.
>>
>> I agree with you that the definition of "word" is different from grapheme
>> cluster, and I guess answering to that question is even more difficult.
>>
>>
>> Regards,
>> Koji
>>
>> -----Original Message-----
>> From: Phillips, Addison [mailto:addison@lab126.com]
>> Sent: Friday, January 28, 2011 2:22 PM
>> To: Kang-Hao (Kenny) Lu; Koji Ishii
>> Cc: WWW Style; WWW International
>> Subject: RE: What's the definition of a word? (was: [css3-text] line break
>> opportunities are based on *syllable* boundaries?)
>>
>> The term "grapheme cluster" would be wrong for this context. A grapheme
>> cluster is a sequence of logical characters that form a single visual unit
>> of text (what is sometimes perceived as a "character" or "glyph"). This term
>> is used for cases such as an Indic syllable followed by a combining
>> vowel--in which a base character is combined with additional characters to
>> form a single glyph on screen, rather than cases in which separate
>> visual/logical units form a single "word" or "sound". It also applies to
>> cases such as a base letter followed by a combining accent.
>>
>> To help illustrate this, notice that the word "the" is not a grapheme
>> cluster, although it is a single syllable. Notice too that "ソース" consists of
>> *three* graphemes (grapheme clusters), but only two syllables.
>>
>> The relationship of Han ideographs to both "words" and "syllables" is
>> complex and depends both on the language (it is different for Japanese, for
>> example) and on context. It is sometimes true that "ideograph == syllable"
>> and sometimes also true that "ideograph == word".
>>
>> In any case, the concept of "grapheme cluster" should most definitely not
>> be consider to be synonymous with either "word" or "syllable". It is a
>> distinct unit and may not be *either* in a given context. My understand was
>> that languages written using Han ideographs could be broken anywhere except
>> for certain prescriptive cases (which differ by language). While this might
>> map to some other concept such as syllables, wouldn't it be better to refer
>> specifically to language specific rules?
>>
>> Unicode Standard Annex #14 [1] provides a useful description of
>> line-breaking properties that may be helpful here.
>>
>> Regards,
>>
>> Addison
>>
>> [1] http://www.unicode.org/reports/tr14/
>>
>> Addison Phillips
>> Globalization Architect (Lab126)
>> Chair (W3C I18N, IETF IRI WGs)
>>
>> Internationalization is not a feature.
>> It is an architecture.
>>
>> > -----Original Message-----
>> > From: www-international-request@w3.org [mailto:www-international-
>> > request@w3.org] On Behalf Of Kang-Hao (Kenny) Lu
>> > Sent: Thursday, January 27, 2011 8:43 PM
>> > To: Koji Ishii
>> > Cc: WWW Style; WWW International
>> > Subject: What's the definition of a word? (was: [css3-text] line break
>> > opportunities are based on *syllable* boundaries?)
>> >
>> > > In Chinese, Yi, and Hangul, a character represents a syllable as
>> > far as I understand, but in Japanese, Kanji characters could have more
>> > than one syllable, and also there are cases where multiple characters
>> > represent single syllable (like Kana + prolonged sound mark).
>> > >
>> > > Although this part is not normative, it looks like we should
>> > replace "syllable" with "grapheme cluster".
>> > >
>> > > Please let me know if this change can be incorrect to any other
>> > writing systems listed here than Japanese.
>> >
>> > The situation is similar for Chinese as far as I can tell.
>> >
>> > Speaking about this, this is editorial but the last time I read the
>> > spec, I got a little bit perplexed about the definition of "word".
>> > Is
>> > there a plan to briefly mention what a "word" is in the introduction
>> > section? Or perhaps there should be a glossary that puts "word" and
>> > "grapheme cluster" together? I doubt that there would be a consistent
>> > and precise definition throughout the spec but a brief and non-
>> > normative introduction seems helpful.
>> >
>> >
>> > Cheers,
>> > Kenny
>>
>



-- 
cheers,
-ambrose
Received on Saturday, 29 January 2011 18:50:06 UTC