Re: No Character Normalization? from Paul Hoffman / IMC on 2000-06-26 (w3c-ietf-xmldsig@w3.org from April to June 2000)

From: Paul Hoffman / IMC <phoffman@imc.org>
Date: Mon, 26 Jun 2000 12:18:46 -0700
To: w3c-ietf-xmldsig@w3.org
Message-Id: <p04320322b57d5b37d8fa@[165.227.249.13]>

At 1:35 PM -0400 6/26/00, John Cowan wrote:
>Kevin Regan wrote:
>
>>  If it is the usual case that documents are created in the normalized
>>  form, then it does not seem like a big issue.  What would happen
>>  in the case of an editor or application written in Java (Unicode)?
>
>Most people do not have the capability of keyboarding separate accent
>marks anyhow (their keyboards generate the normalized forms).

But this is a gross oversimplification of how users might enter 
non-canonicalized characters in a document. An easy example from 
plane zero is U+00BC (VULGAR FRACTION ONE QUARTER). Microsoft Word 
(and other programs) will insert this into a document as its 
uncanonicalized form; Word will even do it behind your back unless 
you turn off Word's default "helpful" auto-correction feature. U+00BC 
canonicalizes into U+0031 followed by U+2044 followed by U+0034.

There are dozens of other common cases of easily-entered 
non-canconical forms, and thousands of less common cases that could 
still be found without much effort.

--Paul Hoffman, Director
--Internet Mail Consortium

Received on Monday, 26 June 2000 15:18:57 UTC