Re: Putting Word-breaking in CLReq?

fwiw, the issue of determining word boundaries in Chinese came up while 
we were working on SSML. In places where the boundaries were ambiguous 
you could use the w element (also known as token element), to show where 
the boundaries should be so that the pronunciation was correct, eg.see 
http://www.w3.org/TR/speech-synthesis11/#g3182

<!-- The Nanjing Changjiang River Bridge -->
   <token>南京市</token><token>长江大桥</token>
   <!-- The mayor of Nanjing city, Jiang Daqiao -->
   南京市长<w>江大桥</w>
   <!-- Shanghai is a metropolis -->
   上海是个<w>大都会</w>
   <!-- Most Shanghainese will say something like that -->
   上海人<w>大都</w>会那么说

In HTML5 one could presumably achieve the same thing using the (empty) 
<wbr/> element.

ri




On 25/07/2015 13:05, HU, Chunming wrote:
> Exactly, Addison.
>
> For some applications (such as search key word understanding,
> information retrieval & knowledge graph), better segmentation techniques
> are required (dictionary, NLP, statistic algorithms, learning algorithms…).
>
> But, from the layout point of view, we should not care about the
> segmentation issue, we could make the line break between any 2 Chinese
> characters.
>
> *From:*Phillips, Addison [mailto:addison@lab126.com]
> *Sent:* Saturday, July 25, 2015 12:50 AM
> *To:* HU, Chunming; 'Xiaoqian Wu'; public-clreq-admin@w3.org
> *Subject:* RE: Putting Word-breaking in CLReq?
>
> As far as I know, there are no character-level mechanisms for finding
> actual word boundaries in Chinese. Since Chinese text layout (e.g. line
> breaking) does not depend on word boundaries, this isn’t an impediment
> to rendering the text. In addition, most text selection in Chinese is
> done a character-at-a-time, making word selection depend on the action
> of the user.
>
> That said, there are applications such as e-books, in which accurate
> word selection is important for other purposes (dictionary lookup comes
> to mind). Normally word segmentation features in Chinese applications
> depends on NLP (natural language processing) libraries or on statistical
> methods. In addition, because algorithms are often wrong (or cannot
> identify accurately an extended term such as the example given below of
> “武汉市长江大桥”), the selection boundaries are usually made editable by
> the user.
>
> Addison
>
> *From:*HU, Chunming [mailto:hucm@w3.org]
> *Sent:* Friday, July 24, 2015 5:34 AM
> *To:* 'Xiaoqian Wu'; public-clreq-admin@w3.org
> <mailto:public-clreq-admin@w3.org>
> *Subject:* RE: Putting Word-breaking in CLReq?
>
> 比如“……的……”中的“的”不能出现在下一行行首。
>
> ? really?
>
> 可以找到大量的出版物,并不遵从这一条要求。
>
> *From:*public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org
> <mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org>
> [mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org]
> *On Behalf Of *Xiaoqian Wu
> *Sent:* Friday, July 24, 2015 6:02 PM
> *To:* public-clreq-admin@w3.org <mailto:public-clreq-admin@w3.org>
> *Subject:* Putting Word-breaking in CLReq?
>
> In case I forget about this in the next meeting, here’s a request about
> word-breaking and the relevant discussion. Word breaking is important
> for the Selection and Editing APIs. Shall we provide some brief answers
> to this topic in the CLReq?
>
> Q: Does anyone know of character level mechanisms used to advise
> alogrithms of the word boundaries (or lack of boundaries) in Chinese text?
>
> https://lists.w3.org/Archives/Public/public-html-ig-zh/2015Jul/0004.html
>
> From: Li Songfeng
>
> 中文正文断词除了标点不能位于行首以及单字不成行(一个字不能占一行)、孤行
> 控制(分页情况下,一段第一行出现在页尾或最后一行出现在页首 )外,就想不
> 起来其他规则了。中西文、数字混排会更复杂。中文标题如果太长需要折行,的确
> 有构词的问题,比如“……的……”中的“的”不能出现在下一行行首。
>
> From: Zhang Kun
>
> 这个应该跟排版没什么关系,是中文输入法特有的问题,西文是由空格断词,而中
> 文没有,这就可能出现一些问题,例如,武汉市长江大桥,可以有两种断词方式:
> 武汉市-长江大桥,武汉市长-江大桥,这个人问的问题,在现有技术下没有特别好
> 的机制。
>
> --
>
> xiaoqian
>

Received on Monday, 27 July 2015 07:30:14 UTC