RE: [css3-text] line break opportunities are based on *syllable* boundaries? from CE Whitehead on 2011-01-29 (www-international@w3.org from January to March 2011)

From: CE Whitehead <cewcathar@hotmail.com>
Date: Sat, 29 Jan 2011 13:20:55 -0500
To: <kojiishi@gluesoft.co.jp>, <addison@lab126.com>, <kennyluck@w3.org>
CC: <www-style@w3.org>, <www-international@w3.org>
Message-ID: <SNT142-w12BBBA71C0E4E3506AF46AB3E00@phx.gbl>
Hi, regarding the text:
 
"For most scripts, in the absence of hyphenation a line break occurs only at word boundaries. Many writing systems use spaces or punctuation to explicitly separate words, and line break opportunities can be identified by these characters. Scripts such as Thai, Lao, and Khmer, however, do not use spaces or punctuation to separate words. Although the zero width space (U+200B) can be used as an explicit word delimiter in these scripts, this practice is not common. As a result, a lexical resource is needed to correctly identify break points in such texts. "

I believe it is correct to say "word" here (not "syllable"), but don't know what to do about languages that do not use word deliminters, and can provide no references for Korean, Japanese, or Chinese (though yes a lexical resource seems best).
 
You might qualify your text by saying "in most languages a break occurs at word boundaries."    
 
Perhaps this is off-topic -- but if resources for Arabic are of use to you -- I do believe that the Arabic language does not normally permit word hyphenation -- that is you have to break between words normally. 
(Correct me if I am wrong; also there may be some instances where you can hyphenate in Arabic script;  
http://omega.enstb.org/yannis/pdf/marrakech.pdf who says Arabic script "namely Uighur" permits hyphenation but not the Arabic language;
I also googled and got http://www.tug.org/TUGboat/Articles/tb27-2/tb87benatia.pdf which may be a suitable reference for Arabic; this text does show instances where words are hyphenated -- the rest of the hyphenated word may be placed in the margin; since I do some strange things in English handwriting this looks o.k. to me but I am not an expert on Arabic hyphenation; I have not seen hyphenation in Arabic elsewhere except in this reference.)  
 
(What I know about English which is not in question:  syllables can be sometimes separated with hyphens in English; however not all syllables can be separated with hyphens [traditionally before computers typesetters were not supposed to break between -ing or -ed to my knowledge; the newspapers violated this rule however but they broke words at any character and also misspelled in those days]).  
 
One other note; the text that follows the quoted text has a typo:
 
"In several other writing systems, (including Chinese, Japanese, Yi, and sometimes also Korean) a line break opportunities are based on syllable boundaries, not words"
 
=>
"a line break opportunity" is based  {COMMENT: optionally, "a line break opportunity occurs at a syllable boundary rather than at a word boundary." }
 
or 
 
=>
 
"line break opportunities are based"
 
 
{ "based on" is o.k. here; I just am more used to "occur;" "line breaks are based on" conveys the sense that you are using these boundaries to make a rule; "line breaks occur at" conveys the sense that this is the rule; so perhaps you want to keep with "based on." }
 
 
Best,
 
--C. E. Whitehead
cewcathar@hotmail.com 
 


> From: kojiishi@gluesoft.co.jp
> To: addison@lab126.com; kennyluck@w3.org
> CC: www-style@w3.org; www-international@w3.org
> Date: Fri, 28 Jan 2011 01:55:07 -0500
> Subject: RE: [css3-text] line break opportunities are based on *syllable* boundaries?
> 
> I was also thinking to replace "not words" with something else given Kenny's feedback, but as I re-read the 5. Line Breaking and Word Boundaries section[1] from the first paragraph, it looks like the "word" is pretty well defined within this context.
> 
> > For most scripts, in the absence of hyphenation
> > a line break occurs only at word boundaries.
> > Many writing systems use spaces or punctuation
> > to explicitly separate words, and line break
> > opportunities can be identified by these characters.
> > Scripts such as Thai, Lao, and Khmer, however, do
> > not use spaces or punctuation to separate words.
> > Although the zero width space (U+200B) can be used
> > as an explicit word delimiter in these scripts, this
> > practice is not common. As a result, a lexical resource
> > is needed to correctly identify break points in such texts. 
> >
> > In several other writing systems, (including Chinese,
> > Japanese, Yi, and sometimes also Korean) a line break
> > opportunities are based on syllable boundaries, not words.
> 
> So just changing "syllable" to "grapheme cluster" looks good enough to me.
> 
> What Kenny is asking is probably a generic definition of the "word" regardless of the context. I haven't come up with a good idea, I'd be happy to hear if any good way to do so. But at least for this portion of the text, I think the "word" is well defined.
> 
> [1] http://dev.w3.org/csswg/css3-text/#line-breaking
> 
> 
> Regards,
> Koji
> 
> -----Original Message-----
> From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Koji Ishii
> Sent: Friday, January 28, 2011 3:33 PM
> To: Phillips, Addison; Kang-Hao (Kenny) Lu
> Cc: WWW Style; WWW International
> Subject: RE: [css3-text] line break opportunities are based on *syllable* boundaries?
> 
> I'm changing back to the original subject as you seem to be talking about the original topic, not the definition of "word".
> 
> What I needed here is an appropriate terminology that represents single character within this context:
> 
> > In several other writing systems, (including Chinese, Japanese, Yi, 
> > and sometimes also Korean) a line break opportunities are based on
> > *syllable* boundaries, not words.
> 
> I want "ソース" consists of three, so from what you said, it sounds like "grapheme cluster" is the right choice of words to use here.
> 
> I agree with you that the definition of "word" is different from grapheme cluster, and I guess answering to that question is even more difficult.
> 
> 
> Regards,
> Koji
> 
> -----Original Message-----
> From: Phillips, Addison [mailto:addison@lab126.com]
> Sent: Friday, January 28, 2011 2:22 PM
> To: Kang-Hao (Kenny) Lu; Koji Ishii
> Cc: WWW Style; WWW International
> Subject: RE: What's the definition of a word? (was: [css3-text] line break opportunities are based on *syllable* boundaries?)
> 
> The term "grapheme cluster" would be wrong for this context. A grapheme cluster is a sequence of logical characters that form a single visual unit of text (what is sometimes perceived as a "character" or "glyph"). This term is used for cases such as an Indic syllable followed by a combining vowel--in which a base character is combined with additional characters to form a single glyph on screen, rather than cases in which separate visual/logical units form a single "word" or "sound". It also applies to cases such as a base letter followed by a combining accent.
> 
> To help illustrate this, notice that the word "the" is not a grapheme cluster, although it is a single syllable. Notice too that "ソース" consists of *three* graphemes (grapheme clusters), but only two syllables.
> 
> The relationship of Han ideographs to both "words" and "syllables" is complex and depends both on the language (it is different for Japanese, for example) and on context. It is sometimes true that "ideograph == syllable" and sometimes also true that "ideograph == word".
> 
> In any case, the concept of "grapheme cluster" should most definitely not be consider to be synonymous with either "word" or "syllable". It is a distinct unit and may not be *either* in a given context. My understand was that languages written using Han ideographs could be broken anywhere except for certain prescriptive cases (which differ by language). While this might map to some other concept such as syllables, wouldn't it be better to refer specifically to language specific rules?
> 
> Unicode Standard Annex #14 [1] provides a useful description of line-breaking properties that may be helpful here.
> 
> Regards,
> 
> Addison
> 
> [1] http://www.unicode.org/reports/tr14/
> 
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N, IETF IRI WGs)
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> > -----Original Message-----
> > From: www-international-request@w3.org [mailto:www-international- 
> > request@w3.org] On Behalf Of Kang-Hao (Kenny) Lu
> > Sent: Thursday, January 27, 2011 8:43 PM
> > To: Koji Ishii
> > Cc: WWW Style; WWW International
> > Subject: What's the definition of a word? (was: [css3-text] line break 
> > opportunities are based on *syllable* boundaries?)
> > 
> > > In Chinese, Yi, and Hangul, a character represents a syllable as
> > far as I understand, but in Japanese, Kanji characters could have more 
> > than one syllable, and also there are cases where multiple characters 
> > represent single syllable (like Kana + prolonged sound mark).
> > >
> > > Although this part is not normative, it looks like we should
> > replace "syllable" with "grapheme cluster".
> > >
> > > Please let me know if this change can be incorrect to any other
> > writing systems listed here than Japanese.
> > 
> > The situation is similar for Chinese as far as I can tell.
> > 
> > Speaking about this, this is editorial but the last time I read the 
> > spec, I got a little bit perplexed about the definition of "word".
> > Is
> > there a plan to briefly mention what a "word" is in the introduction 
> > section? Or perhaps there should be a glossary that puts "word" and 
> > "grapheme cluster" together? I doubt that there would be a consistent 
> > and precise definition throughout the spec but a brief and non- 
> > normative introduction seems helpful.
> > 
> > 
> > Cheers,
> > Kenny
>
Received on Saturday, 29 January 2011 18:21:29 UTC