- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 24 Apr 2003 08:53:37 -0400
- To: w3c-i18n-ig@w3.org
- Cc: www-i18n-comments@w3.org, James Clark <jjc@jclark.com>
Forwarded from www-i18n-comments. I have kept James and www-i18n-comments in the loop to facilitate followup discussion. I'll also put this mail on the agenda of next week's teleconf. Regards, Martin. >Date: Thu, 24 Apr 2003 18:07:49 +0700 >From: James Clark <jjc@jclark.com> >To: www-i18n-comments@w3.org >Subject: Early uniform normalization: experience from Thailand >X-Archived-At: http://www.w3.org/mid/3EA7C585.90600@jclark.com >List-Id: <www-i18n-comments.w3.org> >Thailand has been using a policy of what amounts to early uniform >normalization for some years. See > > http://linux.thai.net/thep/th-xim/ > >for some background. > >I think there are some important aspects of how this has been implemented >that are inconsistent with the "Responsibility for Normalization" section >of the Character Model WD, and some important aspects that are not mentioned. > >One important aspect is the handling of suspect text. Let's consider a >concrete example: > >(1) ko kai + sara u + mai ek >(2) ko kai + mai ek + sara u > >(1) is a single grapheme cluster that will be displayed in a single cell. >(1) is the NFC of (2). Thai aware editing software would never create >(2). The interesting point is there is a WTT rule about how (2) should be >displayed: it says the sara u should be displayed in a separate cell (e.g. >below a dotted circle), i.e. (2) should not be displayed the same as >(1). This is different from the normal Unicode behaviour as I understand: >Unicode says (1) and (2) are equivalent and should display the same. In >the context of a policy of early uniform normalization the Thai policy >seems to me to make a lot of sense: the general principle is that >non-normalized text should not be displayed like its normalized >equivalent. This means that if early uniform normalization has gone wrong >and unnormalized text has crept in, the user can easily see it (without >having to resort to od!). > >This seems not quite the same as the recommendation of the character model >draft: "A text-processing component that receives suspect text MUST NOT >perform any normalization-sensitive operations unless it has first >confirmed through inspection that the text is in normalized >form". Suppose a text editor loaded a file containing (2) and displayed >it per the WTT rules, and suppose the user positioned the caret after the >sara u and hit backspace. I think the right thing for the text editor >to do is to remove the sara u. But this seems to be in conflict with the >rule in the WD since deleting the previous character is a normalization >sensitive operation. > >I think the second part of the rule is right: don't normalize the text >(unless the user explicitly asks for it). I'm not exactly sure how to >rephrase the first part. How about something like this? > >- If normalization is a prerequisite for an operation, then don't perform >the operation unless it ... is in normalization form > >- If the operation is meaningful on the both the normalized and >unnormalized form then perform it on the text as it is: don't >automatically normalize it first. > >- Display the unnormalized form so that it is distinguishable from the >normalized form. > >There are also some subtleties in the normal UI behaviour of Thai text >editors that work to support early uniform normalization. This is related >to the c + z + cedilla example. In the context of a user interface, I >think this is going to cause some puzzlement: if a text editor contains >this string and the caret is between the c and the z and the user hits >delete, I think it would be very suprising to a user if the result was a >single c-cedilla, because this would mean that the character preceding the >caret had changed. This puzzlement can be avoided by having the delete >key delete the entire grapheme cluster (i.e, the z together with the >cedilla). The way Thai editors work is as follows: > >- the caret can never be positioned in between a base and a composing >character > >- delete removes the entire immediately following grapheme cluster > >- backspace removes the immediately preceding character; there's no need >for it to delete the entire preceding grapheme cluster (indeed it's >annoying to users if the entire grapheme cluster is deleted) > >With these rules, the only editing operation that can require >normalization is when the user types a composing character. These rules >make it easy and natural for a user to edit text while preserving >normalization. Windows XP (i.e. Uniscribe) appears to implement these >rules (not just for Thai but for combining characters in general), but >lots of other software doesn't get this right. I think some hints about >this in the WD would be useful. > >James > > > > > > > >
Received on Thursday, 24 April 2003 09:06:57 UTC