- From: Andreas Strotmann <strotmann@rrz.uni-koeln.de>
- Date: Tue, 02 May 2006 10:49:15 -0600
- To: juanrgonzaleza@canonicalscience.com
- CC: www-math@w3.org
juanrgonzaleza@canonicalscience.com wrote: > Neil Soiffer wrote: > > >>> Are you sure that Unicode *precomposed characters* are >>> just a rendering technique? >>> >> Yes. They are defined to be equivalent to their >> decomposed equivalent. Since you are very good at >> looking at standards when prompted, I'll let you track >> down the reference as to why they are part of Unicode. >> > > I thought that o + combining-diaeresis and ö were two different things in > Unicode even when both are rendered equal. Of course, both are defined to > be "canonically equivalent" via "canonical decomposition" but are not > defined to be "equivalent". > Another case of a misunderstanding, in this case of the word "canonical". Many Unicode characters, among them for example Vietnamese, which have more than one combining diacritic attached to a Unicode base character, have many different decompositions, all equivalent. Within this equivalence class of decompositions of a Unicode character, there is exactly one which is designated as "canonical". "Canonical equivalence" is just a way to make the term "equivalence" even stronger in its mathematical sense, because there are mathematical equivalence classes that do not have a single canonical member that can generate the whole equivalence class. In other words, o + combining-diaeresis and ö are not just equivalent, they are even canonically equivalent. As far as I remember (and I have followed the Unicode/ISO10646 discussion since before the two merged), quite a few characters in Unicode were added to the standard precisely to enable canonical equivalence classes for all known uses of characters in the world's languages which are covered by Unicode. You can get a better sense of how strong the concept of canonical equivalence is intended to be in the Unicode standard by looking at one of its intended applications: the recommendation for an application that provides digital signatures for a Unicode text is to transform it into its canonical form before calculating the cryptographic hash code for that text. The reason for this recommendation is that any Unicode compliant application is allowed to replace any Unicode character sequence by any equivalent character sequence in this sense, so that the clear-text message the recipient sees may be composed of a different sequence of Unicode characters than the one that the sender wrote. This would break any cryptographic checksum algorithm, were it not for the fact that both the sender and the recipient can calculate it based on the *canonical* form of their versions of the Unicode text which is composed of exactly the same sequence of 16-bit characters at both ends. In a nutshell: canonical equivalence is not, as you seem to think, a weaker form of equivalence, but instead a stronger form of equivalence ("stronger" in a mathematically well-defined way). - Andreas PS: Have you read the Unicode standard? It's truly fascinating reading, every single word of it. I remember seeing this discussed there at considerable length, and with real multi-part character examples, including samples of a whole bag of mutually equivalent character decompositions in order to illustrate the need for not just equivalence, but for the stronger canonical equivalence.
Received on Tuesday, 2 May 2006 16:49:34 UTC