Early uniform normalization: experience from Thailand from James Clark on 2003-04-24 (www-i18n-comments@w3.org from April 2003)

From: James Clark <jjc@jclark.com>
Date: Thu, 24 Apr 2003 18:07:49 +0700
To: www-i18n-comments@w3.org
Message-ID: <3EA7C585.90600@jclark.com>
Thailand has been using a policy of what amounts to early uniform 
normalization for some years.  See

   http://linux.thai.net/thep/th-xim/

for some background.

I think there are some important aspects of how this has been 
implemented that are inconsistent with the "Responsibility for 
Normalization" section of the Character Model WD, and some important 
aspects that are not mentioned.

One important aspect is the handling of suspect text.  Let's consider a 
concrete example:

(1) ko kai + sara u + mai ek
(2) ko kai + mai ek + sara u

(1) is a single grapheme cluster that will be displayed in a single 
cell. (1) is the NFC of (2).   Thai aware editing software would never 
create (2).  The interesting point is there is a WTT rule about how (2) 
should be displayed: it says the sara u should be displayed in a 
separate cell (e.g. below a dotted circle), i.e. (2) should not be 
displayed the same as (1).  This is different from the normal Unicode 
behaviour as I understand: Unicode says (1) and (2) are equivalent and 
should display the same.  In the context of a policy of early uniform 
normalization the Thai policy seems to me to make a lot of sense: the 
general principle is that non-normalized text should not be displayed 
like its normalized equivalent.  This means that if early uniform 
normalization has gone wrong and unnormalized text has crept in, the 
user can easily see it (without having to resort to od!).

This seems not quite the same as the recommendation of the character 
model draft: "A text-processing component that receives suspect text 
MUST NOT perform any normalization-sensitive operations unless it has 
first confirmed through inspection that the text is in normalized form". 
  Suppose a text editor loaded a file containing (2) and displayed it 
per the WTT rules, and suppose the user positioned the caret after the 
sara   u and hit backspace.  I think the right thing for the text editor 
to do is to remove the sara u.  But this seems to be in conflict with 
the rule in the WD since deleting the previous character is a 
normalization sensitive operation.

I think the second part of the rule is right: don't normalize the text 
(unless the user explicitly asks for it).  I'm not exactly sure how to 
rephrase the first part.  How about something like this?

- If normalization is a prerequisite for an operation, then don't 
perform the operation unless it ... is in normalization form

- If the operation is meaningful on the both the normalized and 
unnormalized form then perform it on the text as it is: don't 
automatically normalize it first.

- Display the unnormalized form so that it is distinguishable from the 
normalized form.

There are also some subtleties in the normal UI behaviour of Thai text 
editors that work to support early uniform normalization.  This is 
related to the c + z + cedilla example.  In the context of a user 
interface, I think this is going to cause some puzzlement: if a text 
editor contains this string and the caret is between the c and the z and 
the user hits delete, I think it would be very suprising to a user if 
the result was a single c-cedilla, because this would mean that the 
character preceding the caret had changed.  This puzzlement can be 
avoided by having the delete key delete the entire grapheme cluster 
(i.e, the z together with the cedilla).  The way Thai editors work is as 
follows:

- the caret can never be positioned in between a base and a composing 
character

- delete removes the entire immediately following grapheme cluster

- backspace removes the immediately preceding character; there's no need 
for it to delete the entire preceding grapheme cluster (indeed it's 
annoying to users if the entire grapheme cluster is deleted)

With these rules, the only editing operation that can require 
normalization is when the user types a composing character.  These rules 
make it easy and natural for a user to edit text while preserving 
normalization.  Windows XP (i.e. Uniscribe) appears to implement these 
rules (not just for Thai but for combining characters in general), but 
lots of other software doesn't get this right. I think some hints about 
this in the WD would be useful.

James
Received on Thursday, 24 April 2003 07:08:59 UTC