- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 15 Mar 2000 17:19:45 +0900
- To: xml-editor@w3.org
- Cc: w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
Dear XML Editors and XML Core WG, The XML specification contains the following: match (Of strings or names:) Two strings or names being compared must be identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings. At user option, processors may normalize such characters to some canonical form. No case folding is performed. There has been some discussion about when exactly this 'normalization at user option' may occur, in particular whether it occurs before matching or after matching. The I18N WG has discussed this issue at today's teleconference, and proposes the following: [http://www.w3.org/International/Group/2000/03/telecon4/minutes.html] [http://lists.w3.org/Archives/Member/w3c-i18n-ig/2000Mar/0074.html] - Change "At user option, processors may normalize such characters to some canonical form." to say that this can occur before or after matching, e.g.: "At user option, processors may normalize such characters to some canonical form before or after matching." Rationale: This makes XML conformant with the Unicode standard, which says that an application cannot be expected to distinguish these two (or more) representations. This is also in line with the policy of Early Uniform Normalization, in the following way: By allowing XML processors to do normalization or not, the only way to guarantee uniform behavior on all processors is to normalize early (this argument goes back to a discussion with Tim Bray at WWW8 in Brisbane). [Please note that allowing normalization *before* matching is the point here; whether normalization after matching is allowed or not is not that important. Please also note that we do not think the average XML processor should do actually do normalization.] While we/you are at it, we also propose the following changes: - Change 'some canonical form' to 'some canonical form (preferably Normalization Form C of Unicode Technical Report #15)' [http://www.unicode.org/unicode/reports/tr15/]. Rationale: This does not change the range of allowed behaviour. At the time of writing of the XML specification, no really suitable form was known, but now, there is wide agreement on Normalization Form C. - Change '(e.g. characters with both precomposed and base+diacritic forms)' to '(i.e. canonical equivalents according to [Unicode]; e.g. characters with both precomposed and base+diacritic forms)'. Rationale: That was clearly what was intended. Clarifying it avoids potential misunderstandings, which might include other than canonical equivalences. (including only a subset of the canonical equivalences wouldn't be that much of a problem). Regards, Martin. #-#-# Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org/People/D%C3%BCrst
Received on Wednesday, 15 March 2000 03:18:44 UTC