W3C home > Mailing lists > Public > public-clreq-admin@w3.org > July to September 2015

RE: Putting Word-breaking in CLReq?

From: HU, Chunming <hucm@w3.org>
Date: Sat, 25 Jul 2015 20:05:38 +0800
To: "'Phillips, Addison'" <addison@lab126.com>, "'Xiaoqian Wu'" <xiaoqian@w3.org>, <public-clreq-admin@w3.org>
Cc: "'HU, Chunming'" <hucm@w3.org>
Message-ID: <016601d0c6d2$3ba70de0$b2f529a0$@w3.org>
Exactly, Addison.


For some applications (such as search key word understanding, information retrieval & knowledge graph), better segmentation techniques are required (dictionary, NLP, statistic algorithms, learning algorithms…).


But, from the layout point of view, we should not care about the segmentation issue, we could make the line break between any 2 Chinese characters.


From: Phillips, Addison [mailto:addison@lab126.com] 
Sent: Saturday, July 25, 2015 12:50 AM
To: HU, Chunming; 'Xiaoqian Wu'; public-clreq-admin@w3.org
Subject: RE: Putting Word-breaking in CLReq?


As far as I know, there are no character-level mechanisms for finding actual word boundaries in Chinese. Since Chinese text layout (e.g. line breaking) does not depend on word boundaries, this isn’t an impediment to rendering the text. In addition, most text selection in Chinese is done a character-at-a-time, making word selection depend on the action of the user.


That said, there are applications such as e-books, in which accurate word selection is important for other purposes (dictionary lookup comes to mind). Normally word segmentation features in Chinese applications depends on NLP (natural language processing) libraries or on statistical methods. In addition, because algorithms are often wrong (or cannot identify accurately an extended term such as the example given below of “武汉市长江大桥”), the selection boundaries are usually made editable by the user.




From: HU, Chunming [mailto:hucm@w3.org] 
Sent: Friday, July 24, 2015 5:34 AM
To: 'Xiaoqian Wu'; public-clreq-admin@w3.org
Subject: RE: Putting Word-breaking in CLReq?



? really?



From: public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org [mailto:public-clreq-admin-request+bounce-hucm=w3.org@listhub.w3.org] On Behalf Of Xiaoqian Wu
Sent: Friday, July 24, 2015 6:02 PM
To: public-clreq-admin@w3.org
Subject: Putting Word-breaking in CLReq?


In case I forget about this in the next meeting, here’s a request about word-breaking and the relevant discussion. Word breaking is important for the Selection and Editing APIs. Shall we provide some brief answers to this topic in the CLReq?


Q: Does anyone know of character level mechanisms used to advise alogrithms of the word boundaries (or lack of boundaries) in Chinese text?



From: Li Songfeng

中文正文断词除了标点不能位于行首以及单字不成行(一个字不能占一行)、孤行控制(分页情况下,一段第一行出现在页尾或最后一行出现在页首 )外,就想不起来其他规则了。中西文、数字混排会更复杂。中文标题如果太长需要折行,的确有构词的问题,比如“……的……”中的“的”不能出现在下一行行首。


From: Zhang Kun






Received on Saturday, 25 July 2015 12:06:05 UTC

This archive was generated by hypermail 2.3.1 : Saturday, 25 July 2015 12:06:05 UTC