Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Robert J Burns on 2009-02-03 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Mon, 2 Feb 2009 21:15:37 -0600
To: public-i18n-core@w3.org
Message-Id: <A56A94A3-2E0D-4946-866C-D262FCA7A85E@robburns.com>
Regarding Andrew Cunningham's message[1] which I encountered reading  
the archives:

Hi Andrew et al,

I think it's important to take a step back and look at normalization  
from a broader perspective and understand the intent of normalization.  
First, characters within NFC/NFD strings  are intended within Unicode  
to be equivalent characters. That is the intent of Unicode is that:
  1)  any two strings normalized to one of those forms (either one it  
doesn't matter which form) will match byte for byte.
  2)  authors not use two canonically equivalent characters (or  
character sequences) in a semantically distinct way

The reason for NOT normalizing (aside from the performance gains, the  
just being lazy reason :-)) is to support cases where authors have  
mistakenly treated two canonically equivalent characters as  
semantically distinct (again that's a mistake for an author to do  
that). So the prevalence of Unicode UAs that do not normalize (one way  
or the other NFC or NFD) tends to increase the likelihood that authors  
will make a mistake. In addition the prominence of poorly designed UI  
for input systems also increases the likelihood that authors will make  
such a mistake.

So this gets back to the recurring issue of strict error-handling. If  
the W3C can take a stand and insist on normalizing strings as much as  
possible, this will take care of one part of the author error problem.  
The other issue of better GUI input methods is not really something  
W3C is in a position to address (at least to my knowledge). However, I  
think at the least the W3C could promote (in CSS and elsewhere):

  1) Requiring implementations to perform canonical normalization of  
non-character tokens on parsing (NFC is the general thrust now and so  
there's nothing much to be gained by trying to reverse that)
  2) Requiring implementations to produce NFC for newly produced  
content, especially for markup as opposed to content (which is  
consistent with the NFC normalization of parsing for editors that  
involve parsing
  3) Possibly doing the same for items (1) and (2) even for content
  4) For  characters with the property NFC_Quick_Check=NO (and perhaps  
also the 102 characters with NFC_Quick_Check=MAYBE whenever those are  
used in a non-NFC normalized string)
     * Prohibit the use in markup (as opposed to content)
     * Discourage the use in content
  5) For  characters with the property NFKC_Quick_Check=NO (and  
perhaps also the 102 characters with NFKC_Quick_Check=MAYBE whenever  
those are used in a non-NFKC normalized string)
     * Prohibit the use in markup (as opposed to content)
     * Discourage the use in content

In general I think these rules should be used throughout all W3C  
recommendations. I think we're still early enough in the Unicode  
adoption process to address the situation in a correct manner. In  
other words adopting these rules now would cause little pain now for  
existing content, but any delays would lead to more trouble for  
content producers.

Boris has raised the issue of where from these problems arise? Its  
helpful to understand that in addressing the issue of Unicode  
normalization too. Three competing goals of Unicode lead to the  
normalization confusion:

  1) the desire to have separate diacritical and other combining marks  
that could be combined in any way with any other base characters
  2) the desire to have more compressed encodings where common  
composed characters might be represented in a single code point rather  
than multiple code points (there also was a strong tendency  toward  
pre-composited characters due to simple text rendering not requiring  
OpenType like property tables and the like and a preponderance of text  
systems that used pre-composited characters).
  3) the desire to provide a code point to allow roundtrip mapping of  
every other existing encoding into Unicode/UCS.

Without the second goal and third goals, no normalization form C or KC  
would be possible. Only base characters would have been assigned in  
the UCS. However, with the other two goals many--but not all-- 
precomposed characters have been assigned within Unicode/UCS. Earlier  
Addison said:

> NFC does not guarantee that no combining marks appear in the text.  
> Applying NFC only means that any combining marks that can be  
> combined with their base characters are, in fact, combined.

That is not quite true either. Due to Unicode's normalization  
stabilization policy (and perhaps other reasons I'm not aware of),  
some combining marks that have equivalent precomposed forms, do in  
fact remain in NFC. So there remains folding issues even for  
normalized forms (though much fewer). For example the sequence U+1D157  
U+1D165 is expressible as the precomposed U+1D15E yet U+1D15E is not  
part of an NFC string but instead the sequence U+1D157-U+1D165 is  
instead.

So normalization form C does not guarantee everything is composed,  
however either normalization (NFC or NFD) does guarantee that  
canonically equivalent characters are represented by the same code  
points (so string comparisons can be made between two or more strings  
but string length in terms of grapheme cluster boundaries is not dealt  
with except by following more complex grapheme cluster boundary  
algorithms[2] which might be needed for example in validation where an  
inputted value must remain less than 4 characters in which grapheme  
cluster is probably more the intention than actual code points or  
octets).

Normalization also does not quite guarantee that strings that are  
rendered the same are also represented by the same code points.  
Unicode had a withdrawn Technical Report that might have dealt better  
with that issue. However, the recommendations I listed above combined  
with the a recommendation to be careful in using characters from the  
different scripts within any single markup specification: including  
any NAME, NMTOKEN, attribute value, or other markup (For example,  
don't introduce a semantically distinct element in the HTML namespace  
and name it 'Α' (U+0391) alongside 'A' (U+0041)).

So I don't think NFC has any strong advantages over NFD (or the  
advantages are often over-estimated). However, it is just important to  
pick one normalization or the other and still be aware (and make  
authors and implementors aware) of the limited benefits for  
normalization and the possible issues that still remain even after  
normalization of which authors and implementors need to to remain  
aware. I also think handling this at parsing would be the best way to  
go. Unicode has now tried to get strict about new pre-composed  
character assignment (which in itself introduces a Western bias since  
its minority languages that are left to be allocated and will now NOT  
be assigned precomposed characters for whatever limited benefits  
precomposed characters bring) so one could say that the the preferred  
approach of Unicode is decomposed characters while the preferred  
approach of W3C appears to be NFC. With those somewhat bikeshed-like  
disagreements it is probably better to handle canonical normalization  
in the parser than expect others to always stick with NFC in content  
creation.

Take care,
Rob

[1]: <http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0116.html 
 >
[2]: See <http://www.unicode.org/reports/tr29/> and <http://www.unicode.org/reports/tr14/ 
 >
[3]: <http://www.unicode.org/reports/tr30/tr30-4.html>
Received on Tuesday, 3 February 2009 03:16:21 UTC