Re: Unicode Normalization thread should slow down; summary needed from Robert J Burns on 2009-02-10 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Tue, 10 Feb 2009 11:00:40 -0600
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <9B302B51-7A64-4AE6-A417-305FD5DC5FC8@robburns.com>
Hi Henri,

On Feb 10, 2009, at 6:44 AM, Henri Sivonen wrote:

> On Feb 10, 2009, at 12:47, Robert J Burns wrote:
>
>> I originally like the statement of the problem Henri composed and  
>> added to the wiki page. However, the latest edits that remove L.  
>> David Baron's problem statement actually make the piece almost  
>> impossible to follow. I know what its supposed to say and I have  
>> trouble following it, so I think an uninitiated reader will not  
>> have a clue what the issue is about.
>
> I didn't remove the original problem statement. I *moved* it to the  
> *linked* page that approaches the issue from the Unicode technical  
> background point of view as opposed from the point of view of what  
> authors should be able to do.
>
> I've added a concrete Vietnamese example and noted that the other  
> case needs a concrete example. I also added a quick explanation of  
> normalization forms using the Vietnamese letter as an example.
>
> (It seems that the Vietnamese input mode on Mac OS X normalizes to  
> NFC, by the way. In fact, I wouldn't be at all surprised if Mac OS X  
> already had solution #1 covered and this was just an issue of other  
> systems catching up.)

Having the example back helps dramatically. However, you've taken the  
issue and boiled it down to the solved portion, ignoring what the  
thrust of the thread was about. I'm happy that keyboard input is  
normalized. However, I knew that all along, so that wasn't even the  
discussion I thought I was having.

>> Even before these latest changes the page needed more clarification  
>> added. For example, some of the solutions are difficult to  
>> differentiate: for example number #1 and #4 (originally #3).
>
> What's difficult to differentiate between #1 and #4? Solution #1 is  
> requiring input methods to normalize. Solution #4 is creating a new  
> encoding name utf-8-nfc for allowing authors to opt in to consumer- 
> side normalization on the encoding decoder layer.

Well now they're #2 and #5 (since you've inserted #1 before the  
previous list. I gave them they're summary descriptions, from L. David  
Baron's original email, so it won't necessarily help me to repeat them  
here, but just in case ("Require authoring applications to normalize  
text" and "Require all text in a consistent normalization on the level  
of document conformance").

>
>
>> In any event the latest changes have made the page seem completely  
>> unconnected to the discussions on the list serve.
>
>
> I gathered that the point of moving to the wiki was not to avoid  
> bringing it all to the list serve.

Yes, I gathered the same thing. But if someone unilaterally changes  
the entire page to a completely separate and already solved issue then  
that gets in the way of that approach.

While most keyboards might be able to be designed to limit the input  
of identifiers to canonically ordered character sequences, the problem  
is that characters might be input by all sorts of means (not just  
keyboards): including pasting, character palette, and keyboard input.  
An identifier might begin its life from an innocent copy and paste  
from the document content by the initial author of the identifier.  
Other subsequent authors may try to match the identifier through  
keyboard input or character palette input (perhaps unsuccessfully due  
to differing compositions and orderings). So this is in particular a  
canonical normalization problem (though Henri has attempted, but I'm  
afraid unsuccessfully, to restate in some terms of only keyboard input).

My concern is not the prevalence of such problems. My main concern is  
that such problems are particularly nefarious and particularly  
difficult to track down. So while: 1) adding a very weak normalization  
algorithm to parsers and 2) adding validation or well-formedness  
conditions to conformance checkers — is some work for the makers of  
those tools, it has the potential to eliminate major headaches for  
authors who encounter these quite difficult to diagnose problems. For  
a user/author the order of characters of differing canonical combining  
classes is meaningless. However for byte-wise comparison to work, they  
must be in the same order in both strings. This is central to the way  
Unicode works.

The fantastic thing about Unicode is it takes the thousands of  
languages in the World and the various encodings used to represent  
only a small subset of those languages and abstracts it all in a way  
that makes it relatively simple (or even possible) to implement a text  
system that is simultaneously capable of working equally well in all  
of those thousands of languages. However, it requires that Unicode  
processes deal with a few edge cases (e.g,: language dependent  
collation, grapheme extenders, bidirectional text, and normalization).  
The response from some vendors then appears to be great now my  
software is international, but why should I deal with these few  
nagging edge cases. Dealing with those few edge cases (basically the  
21 named algorithms) is a drop in the sea compared to what would  
otherwise be involved in implementing separate text processing systems  
and separate encodings for hundreds or thousands of languages.

So the great flexibility of combining marks and other grapheme  
extenders in Unicode implies that the canonical equivalence of strings  
must be dealt with. Can an implementation get away with incorrectly  
normalizing strings and gain a few milliseconds of bragging rights  
over the competition. Certainly. But that implementation has basically  
not implemented Unicode. An XML implementation could also only accept  
ASCII characters and probably outshine everything else in performance.  
Imagine the performance gains of only having to deal with fixed-width  
octets instead of UTF-8. Now even grapheme cluster counts can be done  
byte-wise. But that too is not a Unicode implementation.

The great accomplishment of Unicode—its abstraction of all known  
writing systems—  requires some extra processor use (over ASCII for  
example). Using one script for many languages implies that byte-wise  
collation can no longer be used. Mixing multiple languages in one  
document—where some are written left-to-right and others right-to-left— 
implies that issues of bi-directionality must be dealt with. But once  
these few abstractions are dealt with, we have a robust text system  
capable of expressing the writing of every written language ever  
conceived (as far as we know). Such a robust text system is an  
essential building block of almost everything else an application  
might do (few applications get away without some text) and an  
important building block for the World Wide Web.

I know some of you are involved in the creation of Unicode and so I'm  
probably preaching to the choir in those cases. However, there is a  
sentiment floating around these discussions that doing Unicode right  
just isn't worth the trouble. Let someone else fix it. Let authors  
deal with these problems. I just don't think that is appropriate.  
After spending a lot of time looking into Unicode issues including  
normalization and listening to differing points of view, I really  
don't see another solution to normalization than making it a central  
part of text processing (speaking only of canonical non-singleton  
normalization). Such normalization is a central part of the Unicode  
processing model. The hit in an HTML or XML parser of confirming each  
character as normalized in a document upon parsing is a very minor  
performance concern (And even if it wasn't minor, it is a part of  
Unicode processing). When XML parsing is celebrated over text/html  
parsing most of the same people tell us "so what, parsing is  
surrounded by performance bottle necks anyway. So parsing performance  
doesn't matter". Yet when confronted with the prospect of confirming  
each character is not a member of a set of a couple hundred  
characters, we're told that this would cause a performance hit so  
enormous that would be unacceptable to users. I find that completely  
non-credible. Due to the flexibility of Unicode grapheme extension,  
byte-wise string comparison simply isn't possible (without  
normalization anyway). That suggests that the over the life of  
Unicode's development a design decision was made to use some small bit  
of the thousand-fold computing processing-power increase to facilitate  
more flexible text processing. We can't now say, well that's great,  
but we'd rather use those milliseconds somewhere else. We're already  
committed to proper string comparison in terms of canonically ordered  
combining marks (and precomposed v. decomposed characters).

Of course this shouldn't only be the responsibility of parser  
implementations. Input systems should handle normalization (even  
broader normalization than non-singleton canonical normalization) Font  
implementors should be considering normalization. Normalization should  
take place early and often: at least in terms of non-singleton  
canonical normalization. I don't understand how the W3C could be  
considering a partial Unicode implementation as the basis for its  
recommendations.

Take care,
Rob
Received on Tuesday, 10 February 2009 17:01:27 UTC