W3C home > Mailing lists > Public > www-i18n-comments@w3.org > April 2003

Fwd: Early uniform normalization: experience from Thailand

From: Martin Duerst <duerst@w3.org>
Date: Thu, 24 Apr 2003 08:53:37 -0400
Message-Id: <>
To: w3c-i18n-ig@w3.org
Cc: www-i18n-comments@w3.org, James Clark <jjc@jclark.com>

Forwarded from www-i18n-comments. I have kept James and www-i18n-comments
in the loop to facilitate followup discussion. I'll also put this mail
on the agenda of next week's teleconf.            Regards, Martin.

>Date: Thu, 24 Apr 2003 18:07:49 +0700
>From: James Clark <jjc@jclark.com>
>To: www-i18n-comments@w3.org
>Subject: Early uniform normalization: experience from Thailand
>X-Archived-At: http://www.w3.org/mid/3EA7C585.90600@jclark.com
>List-Id: <www-i18n-comments.w3.org>

>Thailand has been using a policy of what amounts to early uniform 
>normalization for some years.  See
>   http://linux.thai.net/thep/th-xim/
>for some background.
>I think there are some important aspects of how this has been implemented 
>that are inconsistent with the "Responsibility for Normalization" section 
>of the Character Model WD, and some important aspects that are not mentioned.
>One important aspect is the handling of suspect text.  Let's consider a 
>concrete example:
>(1) ko kai + sara u + mai ek
>(2) ko kai + mai ek + sara u
>(1) is a single grapheme cluster that will be displayed in a single cell. 
>(1) is the NFC of (2).   Thai aware editing software would never create 
>(2).  The interesting point is there is a WTT rule about how (2) should be 
>displayed: it says the sara u should be displayed in a separate cell (e.g. 
>below a dotted circle), i.e. (2) should not be displayed the same as 
>(1).  This is different from the normal Unicode behaviour as I understand: 
>Unicode says (1) and (2) are equivalent and should display the same.  In 
>the context of a policy of early uniform normalization the Thai policy 
>seems to me to make a lot of sense: the general principle is that 
>non-normalized text should not be displayed like its normalized 
>equivalent.  This means that if early uniform normalization has gone wrong 
>and unnormalized text has crept in, the user can easily see it (without 
>having to resort to od!).
>This seems not quite the same as the recommendation of the character model 
>draft: "A text-processing component that receives suspect text MUST NOT 
>perform any normalization-sensitive operations unless it has first 
>confirmed through inspection that the text is in normalized 
>form".  Suppose a text editor loaded a file containing (2) and displayed 
>it per the WTT rules, and suppose the user positioned the caret after the 
>sara   u and hit backspace.  I think the right thing for the text editor 
>to do is to remove the sara u.  But this seems to be in conflict with the 
>rule in the WD since deleting the previous character is a normalization 
>sensitive operation.
>I think the second part of the rule is right: don't normalize the text 
>(unless the user explicitly asks for it).  I'm not exactly sure how to 
>rephrase the first part.  How about something like this?
>- If normalization is a prerequisite for an operation, then don't perform 
>the operation unless it ... is in normalization form
>- If the operation is meaningful on the both the normalized and 
>unnormalized form then perform it on the text as it is: don't 
>automatically normalize it first.
>- Display the unnormalized form so that it is distinguishable from the 
>normalized form.
>There are also some subtleties in the normal UI behaviour of Thai text 
>editors that work to support early uniform normalization.  This is related 
>to the c + z + cedilla example.  In the context of a user interface, I 
>think this is going to cause some puzzlement: if a text editor contains 
>this string and the caret is between the c and the z and the user hits 
>delete, I think it would be very suprising to a user if the result was a 
>single c-cedilla, because this would mean that the character preceding the 
>caret had changed.  This puzzlement can be avoided by having the delete 
>key delete the entire grapheme cluster (i.e, the z together with the 
>cedilla).  The way Thai editors work is as follows:
>- the caret can never be positioned in between a base and a composing 
>- delete removes the entire immediately following grapheme cluster
>- backspace removes the immediately preceding character; there's no need 
>for it to delete the entire preceding grapheme cluster (indeed it's 
>annoying to users if the entire grapheme cluster is deleted)
>With these rules, the only editing operation that can require 
>normalization is when the user types a composing character.  These rules 
>make it easy and natural for a user to edit text while preserving 
>normalization.  Windows XP (i.e. Uniscribe) appears to implement these 
>rules (not just for Thai but for combining characters in general), but 
>lots of other software doesn't get this right. I think some hints about 
>this in the WD would be useful.
Received on Thursday, 24 April 2003 09:06:57 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:20:14 UTC