Re: No Character Normalization?

Paul Hoffman / IMC wrote:

> But this is a gross oversimplification of how users might enter
> non-canonicalized characters in a document. An easy example from
> plane zero is U+00BC (VULGAR FRACTION ONE QUARTER). Microsoft Word
> (and other programs) will insert this into a document as its
> uncanonicalized form; Word will even do it behind your back unless
> you turn off Word's default "helpful" auto-correction feature. U+00BC
> canonicalizes into U+0031 followed by U+2044 followed by U+0034.

That is a compatibility decomposition, useful for specialized purposes,
but not relevant here. The Normalization Form C of U+00BC is simply U+00BC.
Indeed, every Latin-1 document without exception is already in Normalization
Form C, as is every other document in any charset that contains no combining
marks.

A document in ISO 5426 or ITU T.61 (ISO-IR-103) is not already normalized,
and if naively converted to Unicode will need to have the full normalization
algorithm run on it.

> There are dozens of other common cases of easily-entered
> non-canconical forms, and thousands of less common cases that could
> still be found without much effort.

Can you cite examples?

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Received on Monday, 26 June 2000 16:06:28 UTC