RE: Putting Word-breaking in CLReq? from HU, Chunming on 2015-07-25 (public-clreq-admin@w3.org from July to September 2015)

From: HU, Chunming <hucm@w3.org>
Date: Sat, 25 Jul 2015 20:05:38 +0800
To: "'Phillips, Addison'" <addison@lab126.com>, "'Xiaoqian Wu'" <xiaoqian@w3.org>, <public-clreq-admin@w3.org>
Cc: "'HU, Chunming'" <hucm@w3.org>
Message-ID: <016601d0c6d2$3ba70de0$b2f529a0$@w3.org>

Exactly, Addison.

 

For some applications (such as search key word understanding, information retrieval & knowledge graph), better segmentation techniques are required (dictionary, NLP, statistic algorithms, learning algorithms…).

 

But, from the layout point of view, we should not care about the segmentation issue, we could make the line break between any 2 Chinese characters.

 

From: Phillips, Addison [mailto:addison@lab126.com] 
Sent: Saturday, July 25, 2015 12:50 AM
To: HU, Chunming; 'Xiaoqian Wu'; public-clreq-admin@w3.org
Subject: RE: Putting Word-breaking in CLReq?

 

As far as I know, there are no character-level mechanisms for finding actual word boundaries in Chinese. Since Chinese text layout (e.g. line breaking) does not depend on word boundaries, this isn’t an impediment to rendering the text. In addition, most text selection in Chinese is done a character-at-a-time, making word selection depend on the action of the user.

 

That said, there are applications such as e-books, in which accurate word selection is important for other purposes (dictionary lookup comes to mind). Normally word segmentation features in Chinese applications depends on NLP (natural language processing) libraries or on statistical methods. In addition, because algorithms are often wrong (or cannot identify accurately an extended term such as the example given below of “武汉市长江大桥”), the selection boundaries are usually made editable by the user.

 

Addison

 

From: HU, Chunming [mailto:hucm@w3.org] 
Sent: Friday, July 24, 2015 5:34 AM
To: 'Xiaoqian Wu'; public-clreq-admin@w3.org
Subject: RE: Putting Word-breaking in CLReq?

 

比如“……的……”中的“的”不能出现在下一行行首。

？ really?

可以找到大量的出版物，并不遵从这一条要求。

 

From: public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org [mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org] On Behalf Of Xiaoqian Wu
Sent: Friday, July 24, 2015 6:02 PM
To: public-clreq-admin@w3.org
Subject: Putting Word-breaking in CLReq?

 

In case I forget about this in the next meeting, here’s a request about word-breaking and the relevant discussion. Word breaking is important for the Selection and Editing APIs. Shall we provide some brief answers to this topic in the CLReq?

 

Q: Does anyone know of character level mechanisms used to advise alogrithms of the word boundaries (or lack of boundaries) in Chinese text?

https://lists.w3.org/Archives/Public/public-html-ig-zh/2015Jul/0004.html


 

From: Li Songfeng

中文正文断词除了标点不能位于行首以及单字不成行（一个字不能占一行）、孤行控制（分页情况下，一段第一行出现在页尾或最后一行出现在页首 ）外，就想不起来其他规则了。中西文、数字混排会更复杂。中文标题如果太长需要折行，的确有构词的问题，比如“……的……”中的“的”不能出现在下一行行首。

 

From: Zhang Kun

这个应该跟排版没什么关系，是中文输入法特有的问题，西文是由空格断词，而中文没有，这就可能出现一些问题，例如，武汉市长江大桥，可以有两种断词方式：武汉市-长江大桥，武汉市长-江大桥，这个人问的问题，在现有技术下没有特别好的机制。

 

--

xiaoqian

Received on Saturday, 25 July 2015 12:06:05 UTC