Re: No Character Normalization? from John Cowan on 2000-06-26 (w3c-ietf-xmldsig@w3.org from April to June 2000)

From: John Cowan <jcowan@reutershealth.com>
Date: Mon, 26 Jun 2000 16:05:49 -0400
To: Paul Hoffman / IMC <phoffman@imc.org>, "w3c-ietf-xmldsig@w3.org" <w3c-ietf-xmldsig@w3.org>
Message-ID: <3957B79D.5773C469@reutershealth.com>

Paul Hoffman / IMC wrote:

> But this is a gross oversimplification of how users might enter
> non-canonicalized characters in a document. An easy example from
> plane zero is U+00BC (VULGAR FRACTION ONE QUARTER). Microsoft Word
> (and other programs) will insert this into a document as its
> uncanonicalized form; Word will even do it behind your back unless
> you turn off Word's default "helpful" auto-correction feature. U+00BC
> canonicalizes into U+0031 followed by U+2044 followed by U+0034.

That is a compatibility decomposition, useful for specialized purposes,
but not relevant here. The Normalization Form C of U+00BC is simply U+00BC.
Indeed, every Latin-1 document without exception is already in Normalization
Form C, as is every other document in any charset that contains no combining
marks.

A document in ISO 5426 or ITU T.61 (ISO-IR-103) is not already normalized,
and if naively converted to Unicode will need to have the full normalization
algorithm run on it.

> There are dozens of other common cases of easily-entered
> non-canconical forms, and thousands of less common cases that could
> still be found without much effort.

Can you cite examples?

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Received on Monday, 26 June 2000 16:06:28 UTC