- From: John Cowan <jcowan@reutershealth.com>
- Date: Mon, 26 Jun 2000 16:05:49 -0400
- To: Paul Hoffman / IMC <phoffman@imc.org>, "w3c-ietf-xmldsig@w3.org" <w3c-ietf-xmldsig@w3.org>
Paul Hoffman / IMC wrote: > But this is a gross oversimplification of how users might enter > non-canonicalized characters in a document. An easy example from > plane zero is U+00BC (VULGAR FRACTION ONE QUARTER). Microsoft Word > (and other programs) will insert this into a document as its > uncanonicalized form; Word will even do it behind your back unless > you turn off Word's default "helpful" auto-correction feature. U+00BC > canonicalizes into U+0031 followed by U+2044 followed by U+0034. That is a compatibility decomposition, useful for specialized purposes, but not relevant here. The Normalization Form C of U+00BC is simply U+00BC. Indeed, every Latin-1 document without exception is already in Normalization Form C, as is every other document in any charset that contains no combining marks. A document in ISO 5426 or ITU T.61 (ISO-IR-103) is not already normalized, and if naively converted to Unicode will need to have the full normalization algorithm run on it. > There are dozens of other common cases of easily-entered > non-canconical forms, and thousands of less common cases that could > still be found without much effort. Can you cite examples? -- Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
Received on Monday, 26 June 2000 16:06:28 UTC