- From: Ambrose LI <ambrose.li@gmail.com>
- Date: Sat, 29 Jan 2011 13:48:29 -0500
- To: CE Whitehead <cewcathar@hotmail.com>
- Cc: kojiishi@gluesoft.co.jp, addison@lab126.com, kennyluck@w3.org, www-style@w3.org, www-international@w3.org
Hi, It is true that we don’t use zero-width spaces, at all (most probably because of historical reasons, but also, probably, because it can’t be typed); but I doubt that even “lexical resources” will work, especially in the case of Chinese. (It possibly *might* work in Japanese, with the help of a good parser; but still I’d say it will never work in all cases.) IMHO, the only thing CSS can realistically do is to provide a way for authors to specify “this group of characters is a word / personal name; don’t break it!” As far as I can tell, there’s no such mechanism in place. (And you are right that even in English you can’t really hyphenate between any two syllables; from what I’ve read neither can you do so in French.) 2011/1/29 CE Whitehead <cewcathar@hotmail.com>: > Hi, regarding the text: > > "For most scripts, in the absence of hyphenation a line break occurs only at > word boundaries. Many writing systems use spaces or punctuation to > explicitly separate words, and line break opportunities can be identified by > these characters. Scripts such as Thai, Lao, and Khmer, however, do not use > spaces or punctuation to separate words. Although the zero width space > (U+200B) can be used as an explicit word delimiter in these scripts, this > practice is not common. As a result, a lexical resource is needed to > correctly identify break points in such texts. " > > I believe it is correct to say "word" here (not "syllable"), but don't know > what to do about languages that do not use word deliminters, and can provide > no references for Korean, Japanese, or Chinese (though yes a lexical > resource seems best). > > You might qualify your text by saying "in most languages a break occurs at > word boundaries." > > Perhaps this is off-topic -- but if resources for Arabic are of use to you > -- I do believe that the Arabic language does not normally permit word > hyphenation -- that is you have to break between words normally. > (Correct me if I am wrong; also there may be some instances where you can > hyphenate in Arabic script; > http://omega.enstb.org/yannis/pdf/marrakech.pdf who says Arabic script > "namely Uighur" permits hyphenation but not the Arabic language; > I also googled and got > http://www.tug.org/TUGboat/Articles/tb27-2/tb87benatia.pdf which may be a > suitable reference for Arabic; this text does show instances where words are > hyphenated -- the rest of the hyphenated word may be placed in the margin; > since I do some strange things in English handwriting this looks o.k. to me > but I am not an expert on Arabic hyphenation; I have not seen hyphenation in > Arabic elsewhere except in this reference.) > > (What I know about English which is not in question: syllables can be > sometimes separated with hyphens in English; however not all syllables > can be separated with hyphens [traditionally before computers typesetters > were not supposed to break between -ing or -ed to my knowledge; the > newspapers violated this rule however but they broke words at any character > and also misspelled in those days]). > > One other note; the text that follows the quoted text has a typo: > > "In several other writing systems, (including Chinese, Japanese, Yi, and > sometimes also Korean) a line break opportunities are based on syllable > boundaries, not words" > > => > "a line break opportunity" is based {COMMENT: optionally, "a line break > opportunity occurs at a syllable boundary rather than at a word boundary." } > > or > > => > > "line break opportunities are based" > > > { "based on" is o.k. here; I just am more used to "occur;" "line breaks are > based on" conveys the sense that you are using these boundaries to make a > rule; "line breaks occur at" conveys the sense that this is the rule; so > perhaps you want to keep with "based on." } > > > Best, > > --C. E. Whitehead > cewcathar@hotmail.com > > >> From: kojiishi@gluesoft.co.jp >> To: addison@lab126.com; kennyluck@w3.org >> CC: www-style@w3.org; www-international@w3.org >> Date: Fri, 28 Jan 2011 01:55:07 -0500 >> Subject: RE: [css3-text] line break opportunities are based on *syllable* >> boundaries? >> >> I was also thinking to replace "not words" with something else given >> Kenny's feedback, but as I re-read the 5. Line Breaking and Word Boundaries >> section[1] from the first paragraph, it looks like the "word" is pretty well >> defined within this context. >> >> > For most scripts, in the absence of hyphenation >> > a line break occurs only at word boundaries. >> > Many writing systems use spaces or punctuation >> > to explicitly separate words, and line break >> > opportunities can be identified by these characters. >> > Scripts such as Thai, Lao, and Khmer, however, do >> > not use spaces or punctuation to separate words. >> > Although the zero width space (U+200B) can be used >> > as an explicit word delimiter in these scripts, this >> > practice is not common. As a result, a lexical resource >> > is needed to correctly identify break points in such texts. >> > >> > In several other writing systems, (including Chinese, >> > Japanese, Yi, and sometimes also Korean) a line break >> > opportunities are based on syllable boundaries, not words. >> >> So just changing "syllable" to "grapheme cluster" looks good enough to me. >> >> What Kenny is asking is probably a generic definition of the "word" >> regardless of the context. I haven't come up with a good idea, I'd be happy >> to hear if any good way to do so. But at least for this portion of the text, >> I think the "word" is well defined. >> >> [1] http://dev.w3.org/csswg/css3-text/#line-breaking >> >> >> Regards, >> Koji >> >> -----Original Message----- >> From: www-international-request@w3.org >> [mailto:www-international-request@w3.org] On Behalf Of Koji Ishii >> Sent: Friday, January 28, 2011 3:33 PM >> To: Phillips, Addison; Kang-Hao (Kenny) Lu >> Cc: WWW Style; WWW International >> Subject: RE: [css3-text] line break opportunities are based on *syllable* >> boundaries? >> >> I'm changing back to the original subject as you seem to be talking about >> the original topic, not the definition of "word". >> >> What I needed here is an appropriate terminology that represents single >> character within this context: >> >> > In several other writing systems, (including Chinese, Japanese, Yi, >> > and sometimes also Korean) a line break opportunities are based on >> > *syllable* boundaries, not words. >> >> I want "ソース" consists of three, so from what you said, it sounds like >> "grapheme cluster" is the right choice of words to use here. >> >> I agree with you that the definition of "word" is different from grapheme >> cluster, and I guess answering to that question is even more difficult. >> >> >> Regards, >> Koji >> >> -----Original Message----- >> From: Phillips, Addison [mailto:addison@lab126.com] >> Sent: Friday, January 28, 2011 2:22 PM >> To: Kang-Hao (Kenny) Lu; Koji Ishii >> Cc: WWW Style; WWW International >> Subject: RE: What's the definition of a word? (was: [css3-text] line break >> opportunities are based on *syllable* boundaries?) >> >> The term "grapheme cluster" would be wrong for this context. A grapheme >> cluster is a sequence of logical characters that form a single visual unit >> of text (what is sometimes perceived as a "character" or "glyph"). This term >> is used for cases such as an Indic syllable followed by a combining >> vowel--in which a base character is combined with additional characters to >> form a single glyph on screen, rather than cases in which separate >> visual/logical units form a single "word" or "sound". It also applies to >> cases such as a base letter followed by a combining accent. >> >> To help illustrate this, notice that the word "the" is not a grapheme >> cluster, although it is a single syllable. Notice too that "ソース" consists of >> *three* graphemes (grapheme clusters), but only two syllables. >> >> The relationship of Han ideographs to both "words" and "syllables" is >> complex and depends both on the language (it is different for Japanese, for >> example) and on context. It is sometimes true that "ideograph == syllable" >> and sometimes also true that "ideograph == word". >> >> In any case, the concept of "grapheme cluster" should most definitely not >> be consider to be synonymous with either "word" or "syllable". It is a >> distinct unit and may not be *either* in a given context. My understand was >> that languages written using Han ideographs could be broken anywhere except >> for certain prescriptive cases (which differ by language). While this might >> map to some other concept such as syllables, wouldn't it be better to refer >> specifically to language specific rules? >> >> Unicode Standard Annex #14 [1] provides a useful description of >> line-breaking properties that may be helpful here. >> >> Regards, >> >> Addison >> >> [1] http://www.unicode.org/reports/tr14/ >> >> Addison Phillips >> Globalization Architect (Lab126) >> Chair (W3C I18N, IETF IRI WGs) >> >> Internationalization is not a feature. >> It is an architecture. >> >> > -----Original Message----- >> > From: www-international-request@w3.org [mailto:www-international- >> > request@w3.org] On Behalf Of Kang-Hao (Kenny) Lu >> > Sent: Thursday, January 27, 2011 8:43 PM >> > To: Koji Ishii >> > Cc: WWW Style; WWW International >> > Subject: What's the definition of a word? (was: [css3-text] line break >> > opportunities are based on *syllable* boundaries?) >> > >> > > In Chinese, Yi, and Hangul, a character represents a syllable as >> > far as I understand, but in Japanese, Kanji characters could have more >> > than one syllable, and also there are cases where multiple characters >> > represent single syllable (like Kana + prolonged sound mark). >> > > >> > > Although this part is not normative, it looks like we should >> > replace "syllable" with "grapheme cluster". >> > > >> > > Please let me know if this change can be incorrect to any other >> > writing systems listed here than Japanese. >> > >> > The situation is similar for Chinese as far as I can tell. >> > >> > Speaking about this, this is editorial but the last time I read the >> > spec, I got a little bit perplexed about the definition of "word". >> > Is >> > there a plan to briefly mention what a "word" is in the introduction >> > section? Or perhaps there should be a glossary that puts "word" and >> > "grapheme cluster" together? I doubt that there would be a consistent >> > and precise definition throughout the spec but a brief and non- >> > normative introduction seems helpful. >> > >> > >> > Cheers, >> > Kenny >> > -- cheers, -ambrose
Received on Saturday, 29 January 2011 18:50:06 UTC