- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 15 Mar 2000 17:19:45 +0900
- To: xml-editor@w3.org
- Cc: w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
Dear XML Editors and XML Core WG,
The XML specification contains the following:
match
(Of strings or names:) Two strings or names being compared
must be identical. Characters with multiple possible
representations in ISO/IEC 10646 (e.g. characters with both
precomposed and base+diacritic forms) match only if they
have the same representation in both strings. At user option,
processors may normalize such characters to some canonical
form. No case folding is performed.
There has been some discussion about when exactly this 'normalization
at user option' may occur, in particular whether it occurs before matching
or after matching. The I18N WG has discussed this issue at today's
teleconference, and proposes the following:
[http://www.w3.org/International/Group/2000/03/telecon4/minutes.html]
[http://lists.w3.org/Archives/Member/w3c-i18n-ig/2000Mar/0074.html]
- Change "At user option, processors may normalize such characters to
some canonical form." to say that this can occur before or
after matching, e.g.: "At user option, processors may normalize
such characters to some canonical form before or after matching."
Rationale: This makes XML conformant with the Unicode standard, which
says that an application cannot be expected to distinguish these two
(or more) representations. This is also in line with the policy of
Early Uniform Normalization, in the following way: By allowing XML
processors to do normalization or not, the only way to guarantee
uniform behavior on all processors is to normalize early (this
argument goes back to a discussion with Tim Bray at WWW8 in Brisbane).
[Please note that allowing normalization *before* matching is the
point here; whether normalization after matching is allowed or not
is not that important. Please also note that we do not think
the average XML processor should do actually do normalization.]
While we/you are at it, we also propose the following changes:
- Change 'some canonical form' to 'some canonical form (preferably
Normalization Form C of Unicode Technical Report #15)'
[http://www.unicode.org/unicode/reports/tr15/].
Rationale: This does not change the range of allowed behaviour.
At the time of writing of the XML specification, no really suitable
form was known, but now, there is wide agreement on Normalization
Form C.
- Change '(e.g. characters with both precomposed and base+diacritic forms)'
to '(i.e. canonical equivalents according to [Unicode]; e.g. characters
with both precomposed and base+diacritic forms)'.
Rationale: That was clearly what was intended. Clarifying it avoids potential
misunderstandings, which might include other than canonical equivalences.
(including only a subset of the canonical equivalences wouldn't be that
much of a problem).
Regards, Martin.
#-#-# Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-# mailto:duerst@w3.org http://www.w3.org/People/D%C3%BCrst
Received on Wednesday, 15 March 2000 03:18:44 UTC