- From: Robert R. Chilton <acip@well.com>
- Date: Mon, 26 Feb 2001 10:23:04 -0500
- To: www-i18n-comments@w3.org
This is a response to the circulation of the W3C Working Draft (26 January 2001) of "Character Model for the World Wide Web 1.0" and focuses particularly on considerations of string identity matching and string indexing as regards characters of the Tibetan Block (U+0F00-U+0FCF). Since each and every character in the Tibetan Block that has a canonical decomposition is also listed in the Composition Exclusion Table, Unicode Normalization Form C is equivalent to Unicode Normalization Form D for any string consisting of only Tibetan Block characters. String identity matching and string indexing should therefore be relatively simple for characters in this block. Unfortunately, there are two characters in the Tibetan Block that could pose problems. 1. U+0F7E poses serious problems in string identity matching. U+0F7E RJES SU NGA RO is erroneously assigned a canonical combining class of zero whereas it should be assigned the same combining class (cc = 230) as its related forms at U+0F82 and U+0F83. A situation could easily arise wherein two strings which are identical in appearance will not match, even after normalization. As an example, here are two different ways that processes might encode the frequently occurring syllable HUUm: 0F67 0F7E 0F71 0F74 compared to 0F67 0F71 0F74 0F7E [cc: 0 0 129 132 cc: 0 129 132 0 ] These two strings have identical appearance and meaning and should, after normalization, be an identity match. But because U+0F7E has a canonical combining class of 0, they will not match even after normalization. This serious problem (of non-matching) can be avoided if U+0F7E is assigned a correct canonical combining class of 230. 2. U+0F84 poses possible problems in string indexing. U+0F84 HALANTA is erroneously assigned a canonical combining class of nine, putting it in the class of Indic viramas. In other Indic scripts, these "vowel-killers" have a specific control behavior which is not applicable to the Tibetan Block -- where a different encoding model with a set explicitly combining consonants [U+0F90 to U+0FBC] was adopted. The Tibetan mark halanta/virama (U+0F84) is simply a weak diacritical mark similar to U+0F39 or U+0F82 and it has no control function like U+094D. If a process interprets the U+0F84 as a class 9 character, the process might assume that U+0F84 is a non-printing character and therefore would not count it as a character during certain types of string indexing. 3. U+0F84 poses possible problems in text selection/cursor positioning. Similarly, if a process interprets the U+0F84 as a class 9 character, the process might assume that the U+0F84 is acting (in the manner of an Indic virama) as a joiner and it might wrongly assume that the glyph for the character that precedes U+0F84 is conjoined into a single ligature with the glyph for the character that follows the U+0F84. Due to these erroneous assumptions, the process might expect (e.g., when determining cursor movement/placement and text selection) a display width that does not correspond with the actual display width of the characters in question. Robert Chilton Technical Director, Asian Classics Input Project (USA) UCA & ISO-14651 Specialist, DDC Dzongkha Computing Project (Bhutan)
Received on Monday, 26 February 2001 10:25:53 UTC