Re: No Character Normalization?

Kevin Regan wrote:

> If it is the usual case that documents are created in the normalized
> form, then it does not seem like a big issue.  What would happen
> in the case of an editor or application written in Java (Unicode)?

Most people do not have the capability of keyboarding separate accent
marks anyhow (their keyboards generate the normalized forms).

> Another concern is whether a document can become "de-normalized" during
> transmission.  My previous question was not specific enough. I understand
> that documents can be converted to other character formats. However, I'm
> wondering if a document can leave one application in a normalized form, go
> through various character encodings, and enter another application
> with the characters no longer normalized (e.g.  A Java application to Java
> application might go from Unicode, to UTF-8 for transmission, and then
> back to Unicode in the other application).

No, that would not change the normalization status.  Any passage through an
8-bit set (other than the bibliographic oddballs I mentioned), or a legacy
double-byte set such as Shift_JIS, would leave the text normalized.

> Finally, you mention that the detection of a non-normalized document
> would aid in the discovery of forgery.  My question is: should similar
> documents with different character models be equivalent?

The trouble is: equivalent to whom?  Normalized and non-normalized XML
documents are typically not equivalent to XML parsers, so treating them
as equivalent for signature purposes is dangerous.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Received on Monday, 26 June 2000 13:36:41 UTC