- From: Robert J Burns <rob@robburns.com>
- Date: Tue, 10 Feb 2009 11:00:40 -0600
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
- Message-Id: <9B302B51-7A64-4AE6-A417-305FD5DC5FC8@robburns.com>
Hi Henri, On Feb 10, 2009, at 6:44 AM, Henri Sivonen wrote: > On Feb 10, 2009, at 12:47, Robert J Burns wrote: > >> I originally like the statement of the problem Henri composed and >> added to the wiki page. However, the latest edits that remove L. >> David Baron's problem statement actually make the piece almost >> impossible to follow. I know what its supposed to say and I have >> trouble following it, so I think an uninitiated reader will not >> have a clue what the issue is about. > > I didn't remove the original problem statement. I *moved* it to the > *linked* page that approaches the issue from the Unicode technical > background point of view as opposed from the point of view of what > authors should be able to do. > > I've added a concrete Vietnamese example and noted that the other > case needs a concrete example. I also added a quick explanation of > normalization forms using the Vietnamese letter as an example. > > (It seems that the Vietnamese input mode on Mac OS X normalizes to > NFC, by the way. In fact, I wouldn't be at all surprised if Mac OS X > already had solution #1 covered and this was just an issue of other > systems catching up.) Having the example back helps dramatically. However, you've taken the issue and boiled it down to the solved portion, ignoring what the thrust of the thread was about. I'm happy that keyboard input is normalized. However, I knew that all along, so that wasn't even the discussion I thought I was having. >> Even before these latest changes the page needed more clarification >> added. For example, some of the solutions are difficult to >> differentiate: for example number #1 and #4 (originally #3). > > What's difficult to differentiate between #1 and #4? Solution #1 is > requiring input methods to normalize. Solution #4 is creating a new > encoding name utf-8-nfc for allowing authors to opt in to consumer- > side normalization on the encoding decoder layer. Well now they're #2 and #5 (since you've inserted #1 before the previous list. I gave them they're summary descriptions, from L. David Baron's original email, so it won't necessarily help me to repeat them here, but just in case ("Require authoring applications to normalize text" and "Require all text in a consistent normalization on the level of document conformance"). > > >> In any event the latest changes have made the page seem completely >> unconnected to the discussions on the list serve. > > > I gathered that the point of moving to the wiki was not to avoid > bringing it all to the list serve. Yes, I gathered the same thing. But if someone unilaterally changes the entire page to a completely separate and already solved issue then that gets in the way of that approach. While most keyboards might be able to be designed to limit the input of identifiers to canonically ordered character sequences, the problem is that characters might be input by all sorts of means (not just keyboards): including pasting, character palette, and keyboard input. An identifier might begin its life from an innocent copy and paste from the document content by the initial author of the identifier. Other subsequent authors may try to match the identifier through keyboard input or character palette input (perhaps unsuccessfully due to differing compositions and orderings). So this is in particular a canonical normalization problem (though Henri has attempted, but I'm afraid unsuccessfully, to restate in some terms of only keyboard input). My concern is not the prevalence of such problems. My main concern is that such problems are particularly nefarious and particularly difficult to track down. So while: 1) adding a very weak normalization algorithm to parsers and 2) adding validation or well-formedness conditions to conformance checkers — is some work for the makers of those tools, it has the potential to eliminate major headaches for authors who encounter these quite difficult to diagnose problems. For a user/author the order of characters of differing canonical combining classes is meaningless. However for byte-wise comparison to work, they must be in the same order in both strings. This is central to the way Unicode works. The fantastic thing about Unicode is it takes the thousands of languages in the World and the various encodings used to represent only a small subset of those languages and abstracts it all in a way that makes it relatively simple (or even possible) to implement a text system that is simultaneously capable of working equally well in all of those thousands of languages. However, it requires that Unicode processes deal with a few edge cases (e.g,: language dependent collation, grapheme extenders, bidirectional text, and normalization). The response from some vendors then appears to be great now my software is international, but why should I deal with these few nagging edge cases. Dealing with those few edge cases (basically the 21 named algorithms) is a drop in the sea compared to what would otherwise be involved in implementing separate text processing systems and separate encodings for hundreds or thousands of languages. So the great flexibility of combining marks and other grapheme extenders in Unicode implies that the canonical equivalence of strings must be dealt with. Can an implementation get away with incorrectly normalizing strings and gain a few milliseconds of bragging rights over the competition. Certainly. But that implementation has basically not implemented Unicode. An XML implementation could also only accept ASCII characters and probably outshine everything else in performance. Imagine the performance gains of only having to deal with fixed-width octets instead of UTF-8. Now even grapheme cluster counts can be done byte-wise. But that too is not a Unicode implementation. The great accomplishment of Unicode—its abstraction of all known writing systems— requires some extra processor use (over ASCII for example). Using one script for many languages implies that byte-wise collation can no longer be used. Mixing multiple languages in one document—where some are written left-to-right and others right-to-left— implies that issues of bi-directionality must be dealt with. But once these few abstractions are dealt with, we have a robust text system capable of expressing the writing of every written language ever conceived (as far as we know). Such a robust text system is an essential building block of almost everything else an application might do (few applications get away without some text) and an important building block for the World Wide Web. I know some of you are involved in the creation of Unicode and so I'm probably preaching to the choir in those cases. However, there is a sentiment floating around these discussions that doing Unicode right just isn't worth the trouble. Let someone else fix it. Let authors deal with these problems. I just don't think that is appropriate. After spending a lot of time looking into Unicode issues including normalization and listening to differing points of view, I really don't see another solution to normalization than making it a central part of text processing (speaking only of canonical non-singleton normalization). Such normalization is a central part of the Unicode processing model. The hit in an HTML or XML parser of confirming each character as normalized in a document upon parsing is a very minor performance concern (And even if it wasn't minor, it is a part of Unicode processing). When XML parsing is celebrated over text/html parsing most of the same people tell us "so what, parsing is surrounded by performance bottle necks anyway. So parsing performance doesn't matter". Yet when confronted with the prospect of confirming each character is not a member of a set of a couple hundred characters, we're told that this would cause a performance hit so enormous that would be unacceptable to users. I find that completely non-credible. Due to the flexibility of Unicode grapheme extension, byte-wise string comparison simply isn't possible (without normalization anyway). That suggests that the over the life of Unicode's development a design decision was made to use some small bit of the thousand-fold computing processing-power increase to facilitate more flexible text processing. We can't now say, well that's great, but we'd rather use those milliseconds somewhere else. We're already committed to proper string comparison in terms of canonically ordered combining marks (and precomposed v. decomposed characters). Of course this shouldn't only be the responsibility of parser implementations. Input systems should handle normalization (even broader normalization than non-singleton canonical normalization) Font implementors should be considering normalization. Normalization should take place early and often: at least in terms of non-singleton canonical normalization. I don't understand how the W3C could be considering a partial Unicode implementation as the basis for its recommendations. Take care, Rob
Received on Tuesday, 10 February 2009 17:01:24 UTC