Unicode Normalization thread should slow down; summary needed

I think these threads on Unicode Normalization have gotten a little
out of hand; the messages are coming too quickly for all but a small
number of people to keep up with them.

Since this is a discussion that needs to be discussed with other
groups, I think what we really need to do is prepare a summary of
the reasons for wanting normalization, the possible solutions, and
their known advantages and disadvantages.  We can then share this
with other groups (HTML, ECMA) and other relevant content producers
and consumers.  (A huge thread with lots of back-and-forth isn't
useful to share.)


Here is a start at such a summary, but it's far from complete since
I haven't had the time to read anywhere near all the messages in the
thread, or, for that matter, to incorporate information from many of
the ones I did read.  It might be worth putting this on a wiki
somewhere and trying to complete it (but I don't think it should get
to more than, say, twice as long as it is now; anything very
substantial should be linked).

-David


Statement of Problem:

  Many things that people perceive as "a character" can be
  represented in multiple ways in Unicode.  To take a simple
  example, a small "a" with an acute accent can be represented as
  either:
    U+00E1  LATIN SMALL LETTER A WITH ACUTE
  or as the sequence:
    U+0061  LATIN SMALL LETTER A
    U+0301  COMBINING ACUTE ACCENT
  Tools that users use to input text may vary as to which of these
  forms is produced, depending on the programs used (operating
  systems, input methods, editors) and perhaps on how the user
  enters the text.  There may be more of this variation in some
  languages than others (LINK TO EVIDENCE NEEDED, IF TRUE).

  Unicode normalization is the process of converting to a form in
  which these differences are not present.  NFC normalization is a
  set of rules for converting strings containing characters such as
  those above to the most-combined (composed) form (e.g., U+00E1
  above), and NFD normalization is a set of rules for converting
  everything to the most-separated (decomposed) form (e.g., U+0061
  U+0301 above).  (NFKC and NFKD are analogous normalization forms
  that eliminate even more differences, including some that are
  perceivable; normalization to them does not appear to be under
  consideration.)

  Various Web technologies depend on string matching.  For example,
  CSS selectors allow matching of author-chosen classes and IDs, and
  the document.getElementById() method allows retrieving an element
  by its ID.  When authors use strings in their own language, those
  strings should match when the author perceives those strings to be
  the same, whether or not different tools were used to produce,
  e.g., the markup and the style sheet.  This author expectation is
  not met when the string match fails because of differences in
  Unicode normalization.


Possible solutions:

  (1) State that authors producing content for the Web should use
  tools that always use one normalization.  The preferred
  normalization would need to be defined (NFC appears to be
  preferred by a majority).  Authors who do not follow this
  recommendation risk the problems described above.

    Advantages:
      It does not require changes to Web standards or software that
      consumes Web documents. (CITATION NEEDED)

      (MORE HERE)

    Disadvantages:
      Lots of possible points of failure. (CITATION NEEDED)

      Doesn't substantively improve the problematic situation
      described above.  (CITATION NEEDED)

      (MORE HERE)

  (2) Require a normalization pass during the parsing of text-based
  Web content formats (perhaps after character encoding conversion
  but before parsing), but do not perform any further normalization.
  The preferred normalization would need to be defined (NFC appears
  to be the majority preference).

    Advantages:
      Requires changes to software and specifications at a very
      small number of places.  (CITATION NEEDED)

      (MORE HERE)

    Disadvantages:

      (MORE HERE)

  (3) Require that all data structures representing Web content be
  in a consistent normalization.  (This may be a superset of (2),
  although it might not be precisely, depending on whether parsing
  rules for any Web languages would vary depending on whether
  normalization was done before parsing.)   The preferred
  normalization would need to be defined (NFC appears to be the
  majority preference).

    Advantages:

      (MORE HERE)

    Disadvantages:
      Requires changes to specifications and software at many
      points.  (CITATION NEEDED)

      (MORE HERE)

  (4) Require that all string comparisons done by implementations of
  Web technology report that strings that normalize to the same
  thing compare as equal.  A preferred normalization would not need
  to be defined.

    Advantages:
      Allows whatever normalization the author preferred to produce
      the text in to persist without modification.  (CITATION
      NEEDED)

      (MORE HERE)

    Disadvantages:
      Performance of comparisons.  (CITATION NEEDED)

      Requires changes to specifications and software at many
      points.  (CITATION NEEDED)

      (MORE HERE)

-- 
L. David Baron                                 http://dbaron.org/
Mozilla Corporation                       http://www.mozilla.com/

Received on Friday, 6 February 2009 22:59:12 UTC