Re: No Character Normalization?

At 1:35 PM -0400 6/26/00, John Cowan wrote:
>Kevin Regan wrote:
>
>>  If it is the usual case that documents are created in the normalized
>>  form, then it does not seem like a big issue.  What would happen
>>  in the case of an editor or application written in Java (Unicode)?
>
>Most people do not have the capability of keyboarding separate accent
>marks anyhow (their keyboards generate the normalized forms).

But this is a gross oversimplification of how users might enter 
non-canonicalized characters in a document. An easy example from 
plane zero is U+00BC (VULGAR FRACTION ONE QUARTER). Microsoft Word 
(and other programs) will insert this into a document as its 
uncanonicalized form; Word will even do it behind your back unless 
you turn off Word's default "helpful" auto-correction feature. U+00BC 
canonicalizes into U+0031 followed by U+2044 followed by U+0034.

There are dozens of other common cases of easily-entered 
non-canconical forms, and thousands of less common cases that could 
still be found without much effort.

--Paul Hoffman, Director
--Internet Mail Consortium

Received on Monday, 26 June 2000 15:18:57 UTC