Re: Putting Word-breaking in CLReq? from Richard Ishida on 2015-07-27 (public-clreq-admin@w3.org from July to September 2015)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 27 Jul 2015 08:29:53 +0100
To: "HU, Chunming" <hucm@w3.org>, "'Phillips, Addison'" <addison@lab126.com>, 'Xiaoqian Wu' <xiaoqian@w3.org>, public-clreq-admin@w3.org
Message-ID: <55B5DDF1.40602@w3.org>
fwiw, the issue of determining word boundaries in Chinese came up while 
we were working on SSML. In places where the boundaries were ambiguous 
you could use the w element (also known as token element), to show where 
the boundaries should be so that the pronunciation was correct, eg.see 
http://www.w3.org/TR/speech-synthesis11/#g3182

<!-- The Nanjing Changjiang River Bridge -->
   <token>南京市</token><token>长江大桥</token>
   <!-- The mayor of Nanjing city, Jiang Daqiao -->
   南京市长<w>江大桥</w>
   <!-- Shanghai is a metropolis -->
   上海是个<w>大都会</w>
   <!-- Most Shanghainese will say something like that -->
   上海人<w>大都</w>会那么说

In HTML5 one could presumably achieve the same thing using the (empty) 
<wbr/> element.

ri




On 25/07/2015 13:05, HU, Chunming wrote:
> Exactly, Addison.
>
> For some applications (such as search key word understanding,
> information retrieval & knowledge graph), better segmentation techniques
> are required (dictionary, NLP, statistic algorithms, learning algorithms…).
>
> But, from the layout point of view, we should not care about the
> segmentation issue, we could make the line break between any 2 Chinese
> characters.
>
> *From:*Phillips, Addison [mailto:addison@lab126.com]
> *Sent:* Saturday, July 25, 2015 12:50 AM
> *To:* HU, Chunming; 'Xiaoqian Wu'; public-clreq-admin@w3.org
> *Subject:* RE: Putting Word-breaking in CLReq?
>
> As far as I know, there are no character-level mechanisms for finding
> actual word boundaries in Chinese. Since Chinese text layout (e.g. line
> breaking) does not depend on word boundaries, this isn’t an impediment
> to rendering the text. In addition, most text selection in Chinese is
> done a character-at-a-time, making word selection depend on the action
> of the user.
>
> That said, there are applications such as e-books, in which accurate
> word selection is important for other purposes (dictionary lookup comes
> to mind). Normally word segmentation features in Chinese applications
> depends on NLP (natural language processing) libraries or on statistical
> methods. In addition, because algorithms are often wrong (or cannot
> identify accurately an extended term such as the example given below of
> “武汉市长江大桥”), the selection boundaries are usually made editable by
> the user.
>
> Addison
>
> *From:*HU, Chunming [mailto:hucm@w3.org]
> *Sent:* Friday, July 24, 2015 5:34 AM
> *To:* 'Xiaoqian Wu'; public-clreq-admin@w3.org
> <mailto:public-clreq-admin@w3.org>
> *Subject:* RE: Putting Word-breaking in CLReq?
>
> 比如“……的……”中的“的”不能出现在下一行行首。
>
> ？ really?
>
> 可以找到大量的出版物，并不遵从这一条要求。
>
> *From:*public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org
> <mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org>
> [mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org]
> *On Behalf Of *Xiaoqian Wu
> *Sent:* Friday, July 24, 2015 6:02 PM
> *To:* public-clreq-admin@w3.org <mailto:public-clreq-admin@w3.org>
> *Subject:* Putting Word-breaking in CLReq?
>
> In case I forget about this in the next meeting, here’s a request about
> word-breaking and the relevant discussion. Word breaking is important
> for the Selection and Editing APIs. Shall we provide some brief answers
> to this topic in the CLReq?
>
> Q: Does anyone know of character level mechanisms used to advise
> alogrithms of the word boundaries (or lack of boundaries) in Chinese text?
>
> https://lists.w3.org/Archives/Public/public-html-ig-zh/2015Jul/0004.html
>
> From: Li Songfeng
>
> 中文正文断词除了标点不能位于行首以及单字不成行（一个字不能占一行）、孤行
> 控制（分页情况下，一段第一行出现在页尾或最后一行出现在页首 ）外，就想不
> 起来其他规则了。中西文、数字混排会更复杂。中文标题如果太长需要折行，的确
> 有构词的问题，比如“……的……”中的“的”不能出现在下一行行首。
>
> From: Zhang Kun
>
> 这个应该跟排版没什么关系，是中文输入法特有的问题，西文是由空格断词，而中
> 文没有，这就可能出现一些问题，例如，武汉市长江大桥，可以有两种断词方式：
> 武汉市-长江大桥，武汉市长-江大桥，这个人问的问题，在现有技术下没有特别好
> 的机制。
>
> --
>
> xiaoqian
>
Received on Monday, 27 July 2015 07:30:14 UTC