W3C home > Mailing lists > Public > w3c-ietf-xmldsig@w3.org > April to June 2000

Re: No Character Normalization?

From: Paul Hoffman / IMC <phoffman@imc.org>
Date: Mon, 26 Jun 2000 12:18:46 -0700
Message-Id: <p04320322b57d5b37d8fa@[165.227.249.13]>
To: w3c-ietf-xmldsig@w3.org
At 1:35 PM -0400 6/26/00, John Cowan wrote:
>Kevin Regan wrote:
>
>>  If it is the usual case that documents are created in the normalized
>>  form, then it does not seem like a big issue.  What would happen
>>  in the case of an editor or application written in Java (Unicode)?
>
>Most people do not have the capability of keyboarding separate accent
>marks anyhow (their keyboards generate the normalized forms).

But this is a gross oversimplification of how users might enter 
non-canonicalized characters in a document. An easy example from 
plane zero is U+00BC (VULGAR FRACTION ONE QUARTER). Microsoft Word 
(and other programs) will insert this into a document as its 
uncanonicalized form; Word will even do it behind your back unless 
you turn off Word's default "helpful" auto-correction feature. U+00BC 
canonicalizes into U+0031 followed by U+2044 followed by U+0034.

There are dozens of other common cases of easily-entered 
non-canconical forms, and thousands of less common cases that could 
still be found without much effort.

--Paul Hoffman, Director
--Internet Mail Consortium
Received on Monday, 26 June 2000 15:18:57 GMT

This archive was generated by hypermail 2.2.0 + w3c-0.29 : Thursday, 13 January 2005 12:10:09 GMT