Character canonicalization and XML encoding declarations

Intro:
The prolog of an XML instance contains an (optional) encoding declaration
that specifies the XML instance's character encoding.
For example,

   <?xml version="1.0" encoding="ISO-8859-1"?>
   <doc>Hello World</doc>

XML documents that do not have an encoding declaration must be encoded
as either UTF-8 or UTF-16.  According to section 4.3.3 of
"http://www.w3.org/TR/1998/REC-xml-19980210".  A parser can tell the
difference by checking if there are any Byte Order Marks in the instance.
(This seems to imply that to be sure of the character encoding, the parser
must go through the entire XML instance; might this be
problematic in some applications.)

Here's a couple of problem scenarios...

Problem scenario 1:

Suppose one wants to canonicalize the character encoding to UTF-8 for
the following XML instance:


   <?xml version="1.0" encoding="ISO-8859-1"?>
   <doc>Hello World</doc>

Do we agree the result should be

   <?xml version="1.0" encoding="UTF-8"?>
   <doc>Hello World</doc>

   (note change is to the encoding declaration)

OR

   <?xml version="1.0" encoding="ISO-8859-1"?>
   <doc>Hello World</doc>

   (note no change to encoding declaration)


I vote for changing the encoding declaration.  Everyone agree?



Problem scenario 2:

Suppose one has the following XML instance:

   <?xml version="1.0" encoding="UCS-4"?>
   <doc>

   <stuff-that-maps-directly-to-utf8>
   Nothin but US-ASCII with codes less than 128.
   </stuff-that-maps-directly-to-utf8>

   <stuff-that-requires-more-complicated-conversion-to-utf8>
   {Assume there are UCS-2, UCS-4,
   or other multi-byte character encoding here.}
   </stuff-that-requires-more-complicated-conversion-to-utf8>

   </doc>

And suppose we want to sign one of <doc>'s two child elements.  Do we
require that the extraction mechanism indicate the character encoding
of the content it is giving us?  I say yes, because we need to know
how to convert the content to UTF-8.

One might ask whether, for security (not technical) reasons,
the character encoding of the original XML instance needs to be
signed.  If a signature does not capture the original character
encoding, and if one cannot unambiguously determine the character
encoding from solely the resultant UTF-8, then does the meaning
of what was signed become ambiguous?  I say no, because the
meaning of the UTF-8 is not ambiguous.  Does everyone agree?
Does this mean there is no requirement to capture the orignal
character encoding in the signature?

Regards, Ed

Received on Thursday, 9 September 1999 15:10:25 UTC