- From: Kang-Hao (Kenny) Lu(吕康豪) <lvkanghao@genomics.cn>
- Date: Tue, 21 Jul 2015 15:26:45 +0800
- To: W3C HTML5 中文興趣小組 <public-html-ig-zh@w3.org>
-------- Original Message -------- Subject: Chinese Word Breaking Date: Tue, 21 Jul 2015 07:56:33 +0100 From: Richard Wordingham <richard.wordingham@ntlworld.com> To: unicode@unicode.org I'm puzzled by a statement in UAX #29 Unicode Text Segmentation: "In particular, the characters with the Line_Break property values of Contingent_Break (CB), Complex_Context (SA/Southeast Asian), and Unknown (XX) are assigned word boundary property values based on criteria outside of the scope of this annex. That means that satisfactory treatment of languages like Chinese or Thai requires special handling." Is 'Contingent_Break (CB)' an error for 'Ideographic (ID)'? That would make sense for Chinese, for some applications needs to group ideographs into words. While I am on the topic, does anyone know of character level mechanisms used to advise alogrithms of the word boundaries (or lack of boundaries) in Chinese text? Richard.
Received on Tuesday, 21 July 2015 07:27:32 UTC