unicode normalization AI response from Galt, Stuart A on 2008-11-24 (public-webcgm-wg@w3.org from November 2008)

From: Galt, Stuart A <stuart.a.galt@boeing.com>
Date: Mon, 24 Nov 2008 15:24:36 -0800
To: "WebCGM WG" <public-webcgm-wg@w3.org>
Message-ID: <C8D2620C74DE75488C5FDFBDB9475D6F0CC7EBC5@XCH-NW-7V1.nw.nos.boeing.com>

Hello all,

I was given the action item to "look at Unicode normalization and try to figure out what it means"...

To maintain compatibility with existing standards Unicode contains characters that are equivalent to other characters or sequences of characters. Unicode normalization is a transformation that changes these equivalent character(s) into the same representation. Unicode defines two kinds of equivalence: canonical and compatibility. Canonical equivalence is where the character (or sequence) represents the same character and when displayed will have the same visual appearance and behavior. Compatibility is a weaker in that they represent the same character but may have different appearances. Examples of compatibility equivalence given in unicode.org [1] are font variants, and fractions (1/2 vs ½).

W3C talks about character normalization in Character Model for the World Wide Web 1.0: Normalization [1]. In chapter 3.1 it makes a case for the importance of character normalization and that it is important consideration when doing "string matching" operations. It also becomes important when the data interpreter is not in complete control of the input it will be receiving.

In addition to what kind of normalization the document talks about when the normalization occurs. Early normalization is done at data creation time and all the interpreters and can assume that "it is all taken care of already". This method is suitable to small and/or closed environments. Late normalization is where the interpreter needs to normalize the data.

I now kind of know what normalization is and have found that there are some libraries in all the normal languages (c/c++, java, perl, etc) that can normalize text. If I understand our use case correctly it is the mapping of font names in a companion file to system font names. I have no idea what our exposure is to characters that should be normalized nor how much overhead it would be to use the unicode_normalized_compare instead of a normal string comparison. And if we did decide to use normalized comparisons I do not know which of the four different forms would be best to use for font names.

I have the following questions:

*
Are we going to normalize all text? just the font mapping? other?
*
What form will we use? I suspect normalization form C but we might want some help picking.
*
Do we all have to normalize the same way? If not the above question does not matter.
*
What is the cost benefit (cost of not normalizing vs cost of implementation)?
*
Are people still awake?

Also, while looking up information I found a java applet "normalizer" [3] Okay I admit that I am easily amused...I also read an interesting paper on normalization by Cliff Schmidt at XML Europe 2003 [4]

References:

[1] http://unicode.org/reports/tr15/

[2] http://www.w3.org/TR/charmod-norm/

[3] http://www.unicode.org/unicode/reports/tr15/Normalizer.html

[4] http://dret.net/biblio/reference/sch03a

--
Stuart Galt
SGML Resource Group
stuart.a.galt@boeing.com
(206) 544-3656

Received on Tuesday, 25 November 2008 00:27:15 UTC