Re: Character canonicalization and XML encoding declarations

From:  Ed Simon <ed.simon@entrust.com>
Resent-Date:  Thu, 9 Sep 1999 15:10:32 -0400 (EDT)
Resent-Message-Id:  <199909091910.PAA23533@www19.w3.org>
Message-ID:  <01E1D01C12D7D211AFC70090273D20B105E74B@sothmxs06.entrust.com>
To:  "'w3c-ietf-xmldsig@w3.org'" <w3c-ietf-xmldsig@w3.org>
Date:  Thu, 9 Sep 1999 15:05:25 -0400 

>Intro:
>The prolog of an XML instance contains an (optional) encoding declaration
>that specifies the XML instance's character encoding.
>For example,
>
>   <?xml version="1.0" encoding="ISO-8859-1"?>
>   <doc>Hello World</doc>
>
>XML documents that do not have an encoding declaration must be encoded
>as either UTF-8 or UTF-16.  According to section 4.3.3 of
>"http://www.w3.org/TR/1998/REC-xml-19980210".  A parser can tell the
>difference by checking if there are any Byte Order Marks in the instance.
>(This seems to imply that to be sure of the character encoding, the parser
>must go through the entire XML instance; might this be
>problematic in some applications.)

Actually, the Byte Order Mark is supposed to be the first character
and is part of the encoding, not part of the document...

>Here's a couple of problem scenarios...
>
>Problem scenario 1:
>
>Suppose one wants to canonicalize the character encoding to UTF-8 for
>the following XML instance:
>
>
>   <?xml version="1.0" encoding="ISO-8859-1"?>
>   <doc>Hello World</doc>
>
>Do we agree the result should be
>
>   <?xml version="1.0" encoding="UTF-8"?>
>   <doc>Hello World</doc>
>
>   (note change is to the encoding declaration)
>
>OR
>
>   <?xml version="1.0" encoding="ISO-8859-1"?>
>   <doc>Hello World</doc>
>
>   (note no change to encoding declaration)
>
>
>I vote for changing the encoding declaration.  Everyone agree?

Not really...  You could go either way here...  I would think that in
many applications you would not want the signature to cover the prolog
encoding attribute so it should be omitted either by leaving it out,
which if fine for UTF-8 and UTF-16 (the only two encodings support for
which is mandatory), or by using canonicalizations/transformations
that omit it.  But that kind of begs the question since we are
defining a canonicalization.  If you did do a signature covering it,
it must be important, likely the document is stored/transmitted in the
specified encoding so you need the encoding to understand it, and it
should not be changed.  If may seem odd that the signature and
verification processes would re-encode 'encoding="ISO-8859-1"' into
UTF-8 before hashing but that's the only way to actually sign the
original encoding attribute.  Maybe there is some case where changing
this attribute would change the meaning for an application, which you
could do if it were not signed but rather smashed to UTF-8 and the
'encoding="UTF-8"' signed.  Because you could go either way, in my
mind the deciding factor is simplicity.  For a minimal
canonicalization that does not understand XML syntax it seems like too
much of a pain to figure out if you have a prolog and actually change
this attribute.

>Problem scenario 2:
>
>Suppose one has the following XML instance:
>
>   <?xml version="1.0" encoding="UCS-4"?>
>   <doc>
>
>   <stuff-that-maps-directly-to-utf8>
>   Nothin but US-ASCII with codes less than 128.
>   </stuff-that-maps-directly-to-utf8>
>
>   <stuff-that-requires-more-complicated-conversion-to-utf8>
>   {Assume there are UCS-2, UCS-4,
>   or other multi-byte character encoding here.}
>   </stuff-that-requires-more-complicated-conversion-to-utf8>
>
>   </doc>
>
>And suppose we want to sign one of <doc>'s two child elements.  Do we
>require that the extraction mechanism indicate the character encoding
>of the content it is giving us?  I say yes, because we need to know
>how to convert the content to UTF-8.

I agree with you here but I don't see how a system could ever avoid
indicating the character encoding.  Once it is extracted, some XML no
longer has the initial <? and stuff by which we can figure out its
encoding and character set.  A sequence of bytes in memory can be
ambiguous.  You would want explicit indication of the encoding
accompanying the extract for any use I can think of.

>One might ask whether, for security (not technical) reasons,
>the character encoding of the original XML instance needs to be
>signed.  If a signature does not capture the original character
>encoding, and if one cannot unambiguously determine the character
>encoding from solely the resultant UTF-8, then does the meaning
>of what was signed become ambiguous?  I say no, because the
>meaning of the UTF-8 is not ambiguous.  Does everyone agree?
>Does this mean there is no requirement to capture the orignal
>character encoding in the signature?

If you are defining a canonicalization that normalizes encoding to
UTF-8, it should not introduce any explicit indication of the original
character encoding.  Just as if you are defining a canonicalization
that normalizes line endings, it should not introduce an explicit
indication of what the un-canonicalized line-ends were.  If you make
explicit note in the output of everything you are canonicalizing, you
might as well have used the null canonicaliztion.

>Regards, Ed

Thanks,
Donald

Received on Tuesday, 14 September 1999 14:46:40 UTC