- From: L. David Baron <dbaron@dbaron.org>
- Date: Fri, 6 Feb 2009 14:58:32 -0800
- To: www-style@w3.org, public-i18n-core@w3.org
I think these threads on Unicode Normalization have gotten a little
out of hand; the messages are coming too quickly for all but a small
number of people to keep up with them.
Since this is a discussion that needs to be discussed with other
groups, I think what we really need to do is prepare a summary of
the reasons for wanting normalization, the possible solutions, and
their known advantages and disadvantages. We can then share this
with other groups (HTML, ECMA) and other relevant content producers
and consumers. (A huge thread with lots of back-and-forth isn't
useful to share.)
Here is a start at such a summary, but it's far from complete since
I haven't had the time to read anywhere near all the messages in the
thread, or, for that matter, to incorporate information from many of
the ones I did read. It might be worth putting this on a wiki
somewhere and trying to complete it (but I don't think it should get
to more than, say, twice as long as it is now; anything very
substantial should be linked).
-David
Statement of Problem:
Many things that people perceive as "a character" can be
represented in multiple ways in Unicode. To take a simple
example, a small "a" with an acute accent can be represented as
either:
U+00E1 LATIN SMALL LETTER A WITH ACUTE
or as the sequence:
U+0061 LATIN SMALL LETTER A
U+0301 COMBINING ACUTE ACCENT
Tools that users use to input text may vary as to which of these
forms is produced, depending on the programs used (operating
systems, input methods, editors) and perhaps on how the user
enters the text. There may be more of this variation in some
languages than others (LINK TO EVIDENCE NEEDED, IF TRUE).
Unicode normalization is the process of converting to a form in
which these differences are not present. NFC normalization is a
set of rules for converting strings containing characters such as
those above to the most-combined (composed) form (e.g., U+00E1
above), and NFD normalization is a set of rules for converting
everything to the most-separated (decomposed) form (e.g., U+0061
U+0301 above). (NFKC and NFKD are analogous normalization forms
that eliminate even more differences, including some that are
perceivable; normalization to them does not appear to be under
consideration.)
Various Web technologies depend on string matching. For example,
CSS selectors allow matching of author-chosen classes and IDs, and
the document.getElementById() method allows retrieving an element
by its ID. When authors use strings in their own language, those
strings should match when the author perceives those strings to be
the same, whether or not different tools were used to produce,
e.g., the markup and the style sheet. This author expectation is
not met when the string match fails because of differences in
Unicode normalization.
Possible solutions:
(1) State that authors producing content for the Web should use
tools that always use one normalization. The preferred
normalization would need to be defined (NFC appears to be
preferred by a majority). Authors who do not follow this
recommendation risk the problems described above.
Advantages:
It does not require changes to Web standards or software that
consumes Web documents. (CITATION NEEDED)
(MORE HERE)
Disadvantages:
Lots of possible points of failure. (CITATION NEEDED)
Doesn't substantively improve the problematic situation
described above. (CITATION NEEDED)
(MORE HERE)
(2) Require a normalization pass during the parsing of text-based
Web content formats (perhaps after character encoding conversion
but before parsing), but do not perform any further normalization.
The preferred normalization would need to be defined (NFC appears
to be the majority preference).
Advantages:
Requires changes to software and specifications at a very
small number of places. (CITATION NEEDED)
(MORE HERE)
Disadvantages:
(MORE HERE)
(3) Require that all data structures representing Web content be
in a consistent normalization. (This may be a superset of (2),
although it might not be precisely, depending on whether parsing
rules for any Web languages would vary depending on whether
normalization was done before parsing.) The preferred
normalization would need to be defined (NFC appears to be the
majority preference).
Advantages:
(MORE HERE)
Disadvantages:
Requires changes to specifications and software at many
points. (CITATION NEEDED)
(MORE HERE)
(4) Require that all string comparisons done by implementations of
Web technology report that strings that normalize to the same
thing compare as equal. A preferred normalization would not need
to be defined.
Advantages:
Allows whatever normalization the author preferred to produce
the text in to persist without modification. (CITATION
NEEDED)
(MORE HERE)
Disadvantages:
Performance of comparisons. (CITATION NEEDED)
Requires changes to specifications and software at many
points. (CITATION NEEDED)
(MORE HERE)
--
L. David Baron http://dbaron.org/
Mozilla Corporation http://www.mozilla.com/
Received on Friday, 6 February 2009 22:59:10 UTC