Re: mover vs latin chars with diacriticals (also MathML support)

juanrgonzaleza@canonicalscience.com wrote:
> Neil Soiffer wrote:
>
>   
>>> Are you sure that Unicode *precomposed characters* are
>>> just a rendering technique?
>>>       
>> Yes.  They are defined to be equivalent to their
>> decomposed equivalent.  Since you are very good at
>> looking at standards when prompted, I'll let you track
>> down the reference as to why they are part of Unicode.
>>     
>
> I thought that o + combining-diaeresis and ö were two different things in
> Unicode even when both are rendered equal. Of course, both are defined to
> be "canonically equivalent" via "canonical decomposition" but are not
> defined to be "equivalent".
>   
Another case of a misunderstanding, in this case of the word 
"canonical".  Many Unicode characters, among them for example 
Vietnamese, which have more than one combining diacritic attached to a 
Unicode base character, have many different decompositions, all 
equivalent. Within this equivalence class of decompositions of a Unicode 
character, there is exactly one which is designated as "canonical".

"Canonical equivalence" is just a way to make the term "equivalence" 
even stronger in its mathematical sense, because there are mathematical 
equivalence classes that do not have a single canonical member that can 
generate the whole equivalence class. In other words,  o + 
combining-diaeresis and ö are not just equivalent, they are even 
canonically equivalent.  As far as I remember (and I have followed the 
Unicode/ISO10646 discussion since before the two merged), quite a few 
characters in Unicode were added to the standard precisely to enable 
canonical equivalence classes for all known uses of characters in the 
world's languages which are covered by Unicode.

You can get a better sense of how strong the concept of canonical 
equivalence is intended to be in the Unicode standard by looking at one 
of its intended applications:  the recommendation for an application 
that provides digital signatures for a Unicode text is to transform it 
into its canonical form before calculating the cryptographic hash code 
for that text. The reason for this recommendation is that any Unicode 
compliant application is allowed to replace any Unicode character 
sequence by any equivalent character sequence in this sense, so that the 
clear-text message the recipient sees may be composed of a different 
sequence of Unicode characters than the one that the sender wrote. This 
would break any cryptographic checksum algorithm, were it not for the 
fact that both the sender and the recipient can calculate it based on 
the *canonical* form of their versions of the Unicode text which is 
composed of exactly the same sequence of 16-bit characters at both ends.

In a nutshell: canonical equivalence is not, as you seem to think, a 
weaker form of equivalence, but instead a stronger form of equivalence 
("stronger" in a mathematically well-defined way).

 - Andreas

PS: Have you read the Unicode standard?  It's truly fascinating reading, 
every single word of it. I remember seeing this discussed there at 
considerable length, and with real multi-part character examples, 
including samples of a whole bag of mutually equivalent character 
decompositions in order to illustrate the need for not just equivalence, 
but for the stronger canonical equivalence.

Received on Tuesday, 2 May 2006 16:49:34 UTC