Re: Unicode Normalization

HI fantasai,

On Feb 4, 2009, at 4:13 PM, fantasai wrote:

> Robert J Burns wrote:
>> However Unicode has a SHOULD requirement that two canonically  
>> equivalent but codepoint differing strings match. Unicode's Chapter  
>> 3 (C6 norm) says:
>>> A process shall not assume that the interpretations of two  
>>> canonical-equivalent character sequences are distinct.
>
> That's a MUST requirement. SHALL == MUST, see RFC2119.

You're right thanks for pointing that out. I don't often encounter  
'shall' and incorrectly assumed it mapped to 'should'. However, my  
other arguments remain since this is some quite convoluted prose in  
the Unicode Standard that requires one understand the reason canonical  
equivalent characters exist in the first place (i.e., why separate  
characters have been designated canonical equivalents). This norm  
really then creates a requirement that UAs not make assumptions that  
would undermine the proper treatment of canonically equivalent  
character sequences. So it is still not a MUST that canonically  
equivalent character sequences be treated properly (which I wish it  
was), but a MUST that UAs not get in the way of such proper treatment.

The performance issues raised by Henri could actually work in our  
favor. That is that if we (avoiding the bike shed debates) simply pick  
NFC as the W3C endorsed normalization form for authoring, then we can  
require UAs normalize to NFC. Then the performance hits are the  
responsibility of the authors themselves who go against  
recommendations and produce NFD or non-normalized content. Confirming  
NFC upon parsing is not a performance hit worth discussing. If any  
performance hit arises it would be due to needing to rearrange  
combining characters and replace them with their canonical equivalents.

Take care,
Rob

Received on Wednesday, 4 February 2009 22:35:41 UTC