U+0Fnn (Tibetan Block) characters from Robert R. Chilton on 2001-02-26 (www-i18n-comments@w3.org from February 2001)

From: Robert R. Chilton <acip@well.com>
Date: Mon, 26 Feb 2001 10:23:04 -0500
To: www-i18n-comments@w3.org
Message-ID: <3A9A74D8.FA2@well.com>
This is a response to the circulation of the W3C Working Draft (26
January 2001) of "Character Model for the World Wide Web 1.0" and
focuses particularly on considerations of string identity matching and
string indexing as regards characters of the Tibetan Block
(U+0F00-U+0FCF).

Since each and every character in the Tibetan Block that has a canonical
decomposition is also listed in the Composition Exclusion Table, Unicode
Normalization Form C is equivalent to Unicode Normalization Form D for
any string consisting of only Tibetan Block characters.  String identity
matching and string indexing should therefore be relatively simple for
characters in this block.

Unfortunately, there are two characters in the Tibetan Block that could
pose problems.

1.  U+0F7E poses serious problems in string identity matching.

U+0F7E RJES SU NGA RO is erroneously assigned a canonical combining
class of zero whereas it should be assigned the same combining class
(cc = 230) as its related forms at U+0F82 and U+0F83.  A situation
could easily arise wherein two strings which are identical in appearance
will not match, even after normalization.  As an example, here are two
different ways that processes might encode the frequently occurring
syllable HUUm:

     0F67 0F7E 0F71 0F74   compared to   0F67 0F71 0F74 0F7E
[cc:  0    0   129  132              cc:  0   129  132   0   ]

These two strings have identical appearance and meaning and should,
after normalization, be an identity match.  But because U+0F7E has a
canonical combining class of 0, they will not match even after
normalization.  This serious problem (of non-matching) can be avoided
if U+0F7E is assigned a correct canonical combining class of 230.

2.  U+0F84 poses possible problems in string indexing.

U+0F84 HALANTA is erroneously assigned a canonical combining class of
nine, putting it in the class of Indic viramas.  In other Indic scripts,
these "vowel-killers" have a specific control behavior which is not
applicable to the Tibetan Block -- where a different encoding model with
a set explicitly combining consonants [U+0F90 to U+0FBC] was adopted. 

The Tibetan mark halanta/virama (U+0F84) is simply a weak diacritical
mark similar to U+0F39 or U+0F82 and it has no control function like
U+094D.  If a process interprets the U+0F84 as a class 9 character,
the process might assume that U+0F84 is a non-printing character and
therefore would not count it as a character during certain types of
string indexing.

3.  U+0F84 poses possible problems in text selection/cursor positioning.

Similarly, if a process interprets the U+0F84 as a class 9 character,
the process might assume that the U+0F84 is acting (in the manner of
an Indic virama) as a joiner and it might wrongly assume that the glyph
for the character that precedes U+0F84 is conjoined into a single
ligature with the glyph for the character that follows the U+0F84.  Due
to these erroneous assumptions, the process might expect (e.g., when
determining cursor movement/placement and text selection) a display
width that does not correspond with the actual display width of the
characters in question.


Robert Chilton
Technical Director, Asian Classics Input Project (USA)
UCA & ISO-14651 Specialist, DDC Dzongkha Computing Project (Bhutan)
Received on Monday, 26 February 2001 10:25:53 UTC