- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 27 Jul 2015 08:29:53 +0100
- To: "HU, Chunming" <hucm@w3.org>, "'Phillips, Addison'" <addison@lab126.com>, 'Xiaoqian Wu' <xiaoqian@w3.org>, public-clreq-admin@w3.org
fwiw, the issue of determining word boundaries in Chinese came up while we were working on SSML. In places where the boundaries were ambiguous you could use the w element (also known as token element), to show where the boundaries should be so that the pronunciation was correct, eg.see http://www.w3.org/TR/speech-synthesis11/#g3182 <!-- The Nanjing Changjiang River Bridge --> <token>南京市</token><token>长江大桥</token> <!-- The mayor of Nanjing city, Jiang Daqiao --> 南京市长<w>江大桥</w> <!-- Shanghai is a metropolis --> 上海是个<w>大都会</w> <!-- Most Shanghainese will say something like that --> 上海人<w>大都</w>会那么说 In HTML5 one could presumably achieve the same thing using the (empty) <wbr/> element. ri On 25/07/2015 13:05, HU, Chunming wrote: > Exactly, Addison. > > For some applications (such as search key word understanding, > information retrieval & knowledge graph), better segmentation techniques > are required (dictionary, NLP, statistic algorithms, learning algorithms…). > > But, from the layout point of view, we should not care about the > segmentation issue, we could make the line break between any 2 Chinese > characters. > > *From:*Phillips, Addison [mailto:addison@lab126.com] > *Sent:* Saturday, July 25, 2015 12:50 AM > *To:* HU, Chunming; 'Xiaoqian Wu'; public-clreq-admin@w3.org > *Subject:* RE: Putting Word-breaking in CLReq? > > As far as I know, there are no character-level mechanisms for finding > actual word boundaries in Chinese. Since Chinese text layout (e.g. line > breaking) does not depend on word boundaries, this isn’t an impediment > to rendering the text. In addition, most text selection in Chinese is > done a character-at-a-time, making word selection depend on the action > of the user. > > That said, there are applications such as e-books, in which accurate > word selection is important for other purposes (dictionary lookup comes > to mind). Normally word segmentation features in Chinese applications > depends on NLP (natural language processing) libraries or on statistical > methods. In addition, because algorithms are often wrong (or cannot > identify accurately an extended term such as the example given below of > “武汉市长江大桥”), the selection boundaries are usually made editable by > the user. > > Addison > > *From:*HU, Chunming [mailto:hucm@w3.org] > *Sent:* Friday, July 24, 2015 5:34 AM > *To:* 'Xiaoqian Wu'; public-clreq-admin@w3.org > <mailto:public-clreq-admin@w3.org> > *Subject:* RE: Putting Word-breaking in CLReq? > > 比如“……的……”中的“的”不能出现在下一行行首。 > > ? really? > > 可以找到大量的出版物,并不遵从这一条要求。 > > *From:*public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org > <mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org> > [mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org] > *On Behalf Of *Xiaoqian Wu > *Sent:* Friday, July 24, 2015 6:02 PM > *To:* public-clreq-admin@w3.org <mailto:public-clreq-admin@w3.org> > *Subject:* Putting Word-breaking in CLReq? > > In case I forget about this in the next meeting, here’s a request about > word-breaking and the relevant discussion. Word breaking is important > for the Selection and Editing APIs. Shall we provide some brief answers > to this topic in the CLReq? > > Q: Does anyone know of character level mechanisms used to advise > alogrithms of the word boundaries (or lack of boundaries) in Chinese text? > > https://lists.w3.org/Archives/Public/public-html-ig-zh/2015Jul/0004.html > > From: Li Songfeng > > 中文正文断词除了标点不能位于行首以及单字不成行(一个字不能占一行)、孤行 > 控制(分页情况下,一段第一行出现在页尾或最后一行出现在页首 )外,就想不 > 起来其他规则了。中西文、数字混排会更复杂。中文标题如果太长需要折行,的确 > 有构词的问题,比如“……的……”中的“的”不能出现在下一行行首。 > > From: Zhang Kun > > 这个应该跟排版没什么关系,是中文输入法特有的问题,西文是由空格断词,而中 > 文没有,这就可能出现一些问题,例如,武汉市长江大桥,可以有两种断词方式: > 武汉市-长江大桥,武汉市长-江大桥,这个人问的问题,在现有技术下没有特别好 > 的机制。 > > -- > > xiaoqian >
Received on Monday, 27 July 2015 07:30:14 UTC