- From: Robert J Burns <rob@robburns.com>
- Date: Tue, 10 Feb 2009 11:00:40 -0600
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
- Message-Id: <9B302B51-7A64-4AE6-A417-305FD5DC5FC8@robburns.com>
Hi Henri,
On Feb 10, 2009, at 6:44 AM, Henri Sivonen wrote:
> On Feb 10, 2009, at 12:47, Robert J Burns wrote:
>
>> I originally like the statement of the problem Henri composed and
>> added to the wiki page. However, the latest edits that remove L.
>> David Baron's problem statement actually make the piece almost
>> impossible to follow. I know what its supposed to say and I have
>> trouble following it, so I think an uninitiated reader will not
>> have a clue what the issue is about.
>
> I didn't remove the original problem statement. I *moved* it to the
> *linked* page that approaches the issue from the Unicode technical
> background point of view as opposed from the point of view of what
> authors should be able to do.
>
> I've added a concrete Vietnamese example and noted that the other
> case needs a concrete example. I also added a quick explanation of
> normalization forms using the Vietnamese letter as an example.
>
> (It seems that the Vietnamese input mode on Mac OS X normalizes to
> NFC, by the way. In fact, I wouldn't be at all surprised if Mac OS X
> already had solution #1 covered and this was just an issue of other
> systems catching up.)
Having the example back helps dramatically. However, you've taken the
issue and boiled it down to the solved portion, ignoring what the
thrust of the thread was about. I'm happy that keyboard input is
normalized. However, I knew that all along, so that wasn't even the
discussion I thought I was having.
>> Even before these latest changes the page needed more clarification
>> added. For example, some of the solutions are difficult to
>> differentiate: for example number #1 and #4 (originally #3).
>
> What's difficult to differentiate between #1 and #4? Solution #1 is
> requiring input methods to normalize. Solution #4 is creating a new
> encoding name utf-8-nfc for allowing authors to opt in to consumer-
> side normalization on the encoding decoder layer.
Well now they're #2 and #5 (since you've inserted #1 before the
previous list. I gave them they're summary descriptions, from L. David
Baron's original email, so it won't necessarily help me to repeat them
here, but just in case ("Require authoring applications to normalize
text" and "Require all text in a consistent normalization on the level
of document conformance").
>
>
>> In any event the latest changes have made the page seem completely
>> unconnected to the discussions on the list serve.
>
>
> I gathered that the point of moving to the wiki was not to avoid
> bringing it all to the list serve.
Yes, I gathered the same thing. But if someone unilaterally changes
the entire page to a completely separate and already solved issue then
that gets in the way of that approach.
While most keyboards might be able to be designed to limit the input
of identifiers to canonically ordered character sequences, the problem
is that characters might be input by all sorts of means (not just
keyboards): including pasting, character palette, and keyboard input.
An identifier might begin its life from an innocent copy and paste
from the document content by the initial author of the identifier.
Other subsequent authors may try to match the identifier through
keyboard input or character palette input (perhaps unsuccessfully due
to differing compositions and orderings). So this is in particular a
canonical normalization problem (though Henri has attempted, but I'm
afraid unsuccessfully, to restate in some terms of only keyboard input).
My concern is not the prevalence of such problems. My main concern is
that such problems are particularly nefarious and particularly
difficult to track down. So while: 1) adding a very weak normalization
algorithm to parsers and 2) adding validation or well-formedness
conditions to conformance checkers — is some work for the makers of
those tools, it has the potential to eliminate major headaches for
authors who encounter these quite difficult to diagnose problems. For
a user/author the order of characters of differing canonical combining
classes is meaningless. However for byte-wise comparison to work, they
must be in the same order in both strings. This is central to the way
Unicode works.
The fantastic thing about Unicode is it takes the thousands of
languages in the World and the various encodings used to represent
only a small subset of those languages and abstracts it all in a way
that makes it relatively simple (or even possible) to implement a text
system that is simultaneously capable of working equally well in all
of those thousands of languages. However, it requires that Unicode
processes deal with a few edge cases (e.g,: language dependent
collation, grapheme extenders, bidirectional text, and normalization).
The response from some vendors then appears to be great now my
software is international, but why should I deal with these few
nagging edge cases. Dealing with those few edge cases (basically the
21 named algorithms) is a drop in the sea compared to what would
otherwise be involved in implementing separate text processing systems
and separate encodings for hundreds or thousands of languages.
So the great flexibility of combining marks and other grapheme
extenders in Unicode implies that the canonical equivalence of strings
must be dealt with. Can an implementation get away with incorrectly
normalizing strings and gain a few milliseconds of bragging rights
over the competition. Certainly. But that implementation has basically
not implemented Unicode. An XML implementation could also only accept
ASCII characters and probably outshine everything else in performance.
Imagine the performance gains of only having to deal with fixed-width
octets instead of UTF-8. Now even grapheme cluster counts can be done
byte-wise. But that too is not a Unicode implementation.
The great accomplishment of Unicode—its abstraction of all known
writing systems— requires some extra processor use (over ASCII for
example). Using one script for many languages implies that byte-wise
collation can no longer be used. Mixing multiple languages in one
document—where some are written left-to-right and others right-to-left—
implies that issues of bi-directionality must be dealt with. But once
these few abstractions are dealt with, we have a robust text system
capable of expressing the writing of every written language ever
conceived (as far as we know). Such a robust text system is an
essential building block of almost everything else an application
might do (few applications get away without some text) and an
important building block for the World Wide Web.
I know some of you are involved in the creation of Unicode and so I'm
probably preaching to the choir in those cases. However, there is a
sentiment floating around these discussions that doing Unicode right
just isn't worth the trouble. Let someone else fix it. Let authors
deal with these problems. I just don't think that is appropriate.
After spending a lot of time looking into Unicode issues including
normalization and listening to differing points of view, I really
don't see another solution to normalization than making it a central
part of text processing (speaking only of canonical non-singleton
normalization). Such normalization is a central part of the Unicode
processing model. The hit in an HTML or XML parser of confirming each
character as normalized in a document upon parsing is a very minor
performance concern (And even if it wasn't minor, it is a part of
Unicode processing). When XML parsing is celebrated over text/html
parsing most of the same people tell us "so what, parsing is
surrounded by performance bottle necks anyway. So parsing performance
doesn't matter". Yet when confronted with the prospect of confirming
each character is not a member of a set of a couple hundred
characters, we're told that this would cause a performance hit so
enormous that would be unacceptable to users. I find that completely
non-credible. Due to the flexibility of Unicode grapheme extension,
byte-wise string comparison simply isn't possible (without
normalization anyway). That suggests that the over the life of
Unicode's development a design decision was made to use some small bit
of the thousand-fold computing processing-power increase to facilitate
more flexible text processing. We can't now say, well that's great,
but we'd rather use those milliseconds somewhere else. We're already
committed to proper string comparison in terms of canonically ordered
combining marks (and precomposed v. decomposed characters).
Of course this shouldn't only be the responsibility of parser
implementations. Input systems should handle normalization (even
broader normalization than non-singleton canonical normalization) Font
implementors should be considering normalization. Normalization should
take place early and often: at least in terms of non-singleton
canonical normalization. I don't understand how the W3C could be
considering a partial Unicode implementation as the basis for its
recommendations.
Take care,
Rob
Received on Tuesday, 10 February 2009 17:01:27 UTC