- From: Robert J Burns <rob@robburns.com>
- Date: Wed, 4 Feb 2009 15:07:59 -0600
- To: "Anne van Kesteren" <annevk@opera.com>
- Cc: "Aryeh Gregor" <Simetrical+w3c@gmail.com>, public-i18n-core@w3.org, jonathan@jfkew.plus.com, "W3C Style List" <www-style@w3.org>
Hi Anne, On Feb 4, 2009, at 11:01 AM, Anne van Kesteren wrote: > On Tue, 03 Feb 2009 19:38:32 +0100, Robert J Burns > <rob@robburns.com> wrote: >> Anne wrote: >>> (As far as I can tell XML is Unicode >>> Normalization agnostic. It merely recommends authors to do a certain >>> thing. We can certainly recommend authors to do a certain thing in >>> HTML >>> and CSS too...) >> >> XML is not Unicode agnostic. > > I did not say that. Some more explanation of what you meant would be helpful. Are you just saying that XML is Unicode normalization agnostic in terms that it doesn't matter which normalization form is used in comparing string (NFC or NFD). If thats what you meant then we agree. If you meant that XML is Unicode normalization agnostic in that it doesn't care (or know?) whether two canonically equivalent strings are a match then there I disagree with that. Unicode is fairly clear that two canonically equivalent strings are equivalent even if their code points differ. >> Unicode is a normative reference in terms of text handling. So an >> XML UA is by definition also a Unicode UA. That means that an >> implementation needs to have some reason for comparing two byte- >> wise unequal though canonically equivalent strings and determining >> they do not match. I haven't heard anyone here say why an XML >> processor needs to support (and therefore promote) such errors. > > The XML grammar is expressed in Unicode codepoints so comparison > also happens on that level. However Unicode has a SHOULD requirement that two canonically equivalent but codepoint differing strings match. Unicode's Chapter 3 (C6 norm) says: > A process shall not assume that the interpretations of two canonical- > equivalent character sequences are distinct. If the strings are not interpreted as distinct then they are interpreted as the same: as equivalent strings even though they have different code points (I admit Unicode's wording here is poor, but this is how I understand this recommendation). The next question is why does Unicode leave a should here and not a MUST. My reading of that is that it opens up the possibility that a UA might have a reason to treat the strings as distinct and therefore a UA might have good reason to ignore the SHOULD recommendation. I have a hard time thinking about what types of UA might want to ignore this. When one UA ignores it it might create a need for other UAs to ignore it. But even more, I can think of no reason that XML (or CSS or HTML), as a consumer of Unicode, would want to ignore this recommendation. You have said that a use case might be that authors want to use canonically equivalent characters as semantically distinct. Do you have any real world examples of this? And even if we could find some real world examples of this, why would XML (or CSS or text/HTML or javascript) really want to facilitate such poor practice. Take for example the following three strings (NFD, NFC and non- normalized): 〈this string〉 〈this string〉 〈this string〉 The are no combining marks here. The angle brackets each represent canonically equivalent characters where UAs should not interpret them as distinct (but may do so if there's some reason to do so). Ideally we wouldn't have this problem in Unicode, but we do. In many ways it is more nefarious than case-sensitivity since (unless there's a buggy font) the glyphs should be identical so there's no way for an author to track down the problem (this is in contrast to the non-canonical compatibility characters where at least glyph difference should point authors to the problem). To me it is an error to use canonically equivalent characters as semantically distinct characters. I agree with Henri that input system should address this. However, input systems have absolutely zero specifications that I am aware of that provide guidance on this. The Unicode Standard doesn't address input systems. I think it should and that such norms should address the canonical and compatibility equivalent characters, but until that happens there's nothing saying what input system should do (or MUST do). So to say this is a bug in the input systems that should be addressed there is not really tenable. Even if new input systems addressed the situation, there would still be Unicode 1-6 UAs out there producing non-normalized content. Moreover even when an input system fixes the situation internally, there will still be a UA that outputs NFC and another UA that outputs NFD, so the only way to meet the need of internally matching equivalent Unicode strings is performing parser (or on-the- fly) normalization. Again, it would be better if Unicode never admitted canonical and compatibility equivalent characters (the reasons given for doing so are dubious and they now occupy 5,000 code points out of 65,000 in what has now become a tightly packed and therefore precious Basic Multilingual Plane). Finally, Henri compared this to case sensitivity. Indeed I think we should think of this as something like case sensitivity, however in a situation where XML (and even Unicode) failed to be clear about how to interpret case differing strings. Imagine the XML specification failed to address the situation of case sensitivity and then some UAs went with case-insensitive sting comparisons and others went with case- sensitive string compares. This would be something we would need to address in clarifying XML. I think it is the same with canonical equivalence. It needs to be addressed with the same precision that XML dealt with case-sensitivity. However that has not been done. Simply saying that the input systems will take care of it seems absurd to me (especially when such guidance doesn't yet come from Unicode or W3C). Take care, Rob
Received on Wednesday, 4 February 2009 21:08:42 UTC