Normalization at user option

Dear XML Editors and XML Core WG,

The XML specification contains the following:

match 
               (Of strings or names:) Two strings or names being compared
               must be identical. Characters with multiple possible
               representations in ISO/IEC 10646 (e.g. characters with both
               precomposed and base+diacritic forms) match only if they
               have the same representation in both strings. At user option,
               processors may normalize such characters to some canonical
               form. No case folding is performed.


There has been some discussion about when exactly this 'normalization
at user option' may occur, in particular whether it occurs before matching
or after matching. The I18N WG has discussed this issue at today's
teleconference, and proposes the following:
[http://www.w3.org/International/Group/2000/03/telecon4/minutes.html]
[http://lists.w3.org/Archives/Member/w3c-i18n-ig/2000Mar/0074.html]

- Change "At user option, processors may normalize such characters to
          some canonical form." to say that this can occur before or
          after matching, e.g.: "At user option, processors may normalize
          such characters to some canonical form before or after matching."

  Rationale: This makes XML conformant with the Unicode standard, which
  says that an application cannot be expected to distinguish these two
  (or more) representations. This is also in line with the policy of
  Early Uniform Normalization, in the following way: By allowing XML
  processors to do normalization or not, the only way to guarantee
  uniform behavior on all processors is to normalize early (this
  argument goes back to a discussion with Tim Bray at WWW8 in Brisbane).
  [Please note that allowing normalization *before* matching is the
  point here; whether normalization after matching is allowed or not
  is not that important. Please also note that we do not think
  the average XML processor should do actually do normalization.]


While we/you are at it, we also propose the following changes:

- Change 'some canonical form' to 'some canonical form (preferably
  Normalization Form C of Unicode Technical Report #15)'
  [http://www.unicode.org/unicode/reports/tr15/].

  Rationale: This does not change the range of allowed behaviour.
  At the time of writing of the XML specification, no really suitable
  form was known, but now, there is wide agreement on Normalization
  Form C.

- Change '(e.g. characters with both precomposed and base+diacritic forms)'
  to '(i.e. canonical equivalents according to [Unicode]; e.g. characters
  with both precomposed and base+diacritic forms)'.

  Rationale: That was clearly what was intended. Clarifying it avoids potential
  misunderstandings, which might include other than canonical equivalences.
  (including only a subset of the canonical equivalences wouldn't be that
   much of a problem).


Regards,   Martin.





#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org/People/D%C3%BCrst

Received on Wednesday, 15 March 2000 03:18:44 UTC