- From: Andreas Strotmann <strotmann@rrz.uni-koeln.de>
- Date: Tue, 02 May 2006 10:49:15 -0600
- To: juanrgonzaleza@canonicalscience.com
- CC: www-math@w3.org
juanrgonzaleza@canonicalscience.com wrote:
> Neil Soiffer wrote:
>
>
>>> Are you sure that Unicode *precomposed characters* are
>>> just a rendering technique?
>>>
>> Yes. They are defined to be equivalent to their
>> decomposed equivalent. Since you are very good at
>> looking at standards when prompted, I'll let you track
>> down the reference as to why they are part of Unicode.
>>
>
> I thought that o + combining-diaeresis and ö were two different things in
> Unicode even when both are rendered equal. Of course, both are defined to
> be "canonically equivalent" via "canonical decomposition" but are not
> defined to be "equivalent".
>
Another case of a misunderstanding, in this case of the word
"canonical". Many Unicode characters, among them for example
Vietnamese, which have more than one combining diacritic attached to a
Unicode base character, have many different decompositions, all
equivalent. Within this equivalence class of decompositions of a Unicode
character, there is exactly one which is designated as "canonical".
"Canonical equivalence" is just a way to make the term "equivalence"
even stronger in its mathematical sense, because there are mathematical
equivalence classes that do not have a single canonical member that can
generate the whole equivalence class. In other words, o +
combining-diaeresis and ö are not just equivalent, they are even
canonically equivalent. As far as I remember (and I have followed the
Unicode/ISO10646 discussion since before the two merged), quite a few
characters in Unicode were added to the standard precisely to enable
canonical equivalence classes for all known uses of characters in the
world's languages which are covered by Unicode.
You can get a better sense of how strong the concept of canonical
equivalence is intended to be in the Unicode standard by looking at one
of its intended applications: the recommendation for an application
that provides digital signatures for a Unicode text is to transform it
into its canonical form before calculating the cryptographic hash code
for that text. The reason for this recommendation is that any Unicode
compliant application is allowed to replace any Unicode character
sequence by any equivalent character sequence in this sense, so that the
clear-text message the recipient sees may be composed of a different
sequence of Unicode characters than the one that the sender wrote. This
would break any cryptographic checksum algorithm, were it not for the
fact that both the sender and the recipient can calculate it based on
the *canonical* form of their versions of the Unicode text which is
composed of exactly the same sequence of 16-bit characters at both ends.
In a nutshell: canonical equivalence is not, as you seem to think, a
weaker form of equivalence, but instead a stronger form of equivalence
("stronger" in a mathematically well-defined way).
- Andreas
PS: Have you read the Unicode standard? It's truly fascinating reading,
every single word of it. I remember seeing this discussed there at
considerable length, and with real multi-part character examples,
including samples of a whole bag of mutually equivalent character
decompositions in order to illustrate the need for not just equivalence,
but for the stronger canonical equivalence.
Received on Tuesday, 2 May 2006 16:49:34 UTC