- From: James Clark <jjc@jclark.com>
- Date: Thu, 24 Apr 2003 18:07:49 +0700
- To: www-i18n-comments@w3.org
Thailand has been using a policy of what amounts to early uniform normalization for some years. See http://linux.thai.net/thep/th-xim/ for some background. I think there are some important aspects of how this has been implemented that are inconsistent with the "Responsibility for Normalization" section of the Character Model WD, and some important aspects that are not mentioned. One important aspect is the handling of suspect text. Let's consider a concrete example: (1) ko kai + sara u + mai ek (2) ko kai + mai ek + sara u (1) is a single grapheme cluster that will be displayed in a single cell. (1) is the NFC of (2). Thai aware editing software would never create (2). The interesting point is there is a WTT rule about how (2) should be displayed: it says the sara u should be displayed in a separate cell (e.g. below a dotted circle), i.e. (2) should not be displayed the same as (1). This is different from the normal Unicode behaviour as I understand: Unicode says (1) and (2) are equivalent and should display the same. In the context of a policy of early uniform normalization the Thai policy seems to me to make a lot of sense: the general principle is that non-normalized text should not be displayed like its normalized equivalent. This means that if early uniform normalization has gone wrong and unnormalized text has crept in, the user can easily see it (without having to resort to od!). This seems not quite the same as the recommendation of the character model draft: "A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form". Suppose a text editor loaded a file containing (2) and displayed it per the WTT rules, and suppose the user positioned the caret after the sara u and hit backspace. I think the right thing for the text editor to do is to remove the sara u. But this seems to be in conflict with the rule in the WD since deleting the previous character is a normalization sensitive operation. I think the second part of the rule is right: don't normalize the text (unless the user explicitly asks for it). I'm not exactly sure how to rephrase the first part. How about something like this? - If normalization is a prerequisite for an operation, then don't perform the operation unless it ... is in normalization form - If the operation is meaningful on the both the normalized and unnormalized form then perform it on the text as it is: don't automatically normalize it first. - Display the unnormalized form so that it is distinguishable from the normalized form. There are also some subtleties in the normal UI behaviour of Thai text editors that work to support early uniform normalization. This is related to the c + z + cedilla example. In the context of a user interface, I think this is going to cause some puzzlement: if a text editor contains this string and the caret is between the c and the z and the user hits delete, I think it would be very suprising to a user if the result was a single c-cedilla, because this would mean that the character preceding the caret had changed. This puzzlement can be avoided by having the delete key delete the entire grapheme cluster (i.e, the z together with the cedilla). The way Thai editors work is as follows: - the caret can never be positioned in between a base and a composing character - delete removes the entire immediately following grapheme cluster - backspace removes the immediately preceding character; there's no need for it to delete the entire preceding grapheme cluster (indeed it's annoying to users if the entire grapheme cluster is deleted) With these rules, the only editing operation that can require normalization is when the user types a composing character. These rules make it easy and natural for a user to edit text while preserving normalization. Windows XP (i.e. Uniscribe) appears to implement these rules (not just for Thai but for combining characters in general), but lots of other software doesn't get this right. I think some hints about this in the WD would be useful. James
Received on Thursday, 24 April 2003 07:08:59 UTC