- From: Ed Simon <ed.simon@entrust.com>
- Date: Thu, 9 Sep 1999 15:05:25 -0400
- To: "'w3c-ietf-xmldsig@w3.org'" <w3c-ietf-xmldsig@w3.org>
Intro: The prolog of an XML instance contains an (optional) encoding declaration that specifies the XML instance's character encoding. For example, <?xml version="1.0" encoding="ISO-8859-1"?> <doc>Hello World</doc> XML documents that do not have an encoding declaration must be encoded as either UTF-8 or UTF-16. According to section 4.3.3 of "http://www.w3.org/TR/1998/REC-xml-19980210". A parser can tell the difference by checking if there are any Byte Order Marks in the instance. (This seems to imply that to be sure of the character encoding, the parser must go through the entire XML instance; might this be problematic in some applications.) Here's a couple of problem scenarios... Problem scenario 1: Suppose one wants to canonicalize the character encoding to UTF-8 for the following XML instance: <?xml version="1.0" encoding="ISO-8859-1"?> <doc>Hello World</doc> Do we agree the result should be <?xml version="1.0" encoding="UTF-8"?> <doc>Hello World</doc> (note change is to the encoding declaration) OR <?xml version="1.0" encoding="ISO-8859-1"?> <doc>Hello World</doc> (note no change to encoding declaration) I vote for changing the encoding declaration. Everyone agree? Problem scenario 2: Suppose one has the following XML instance: <?xml version="1.0" encoding="UCS-4"?> <doc> <stuff-that-maps-directly-to-utf8> Nothin but US-ASCII with codes less than 128. </stuff-that-maps-directly-to-utf8> <stuff-that-requires-more-complicated-conversion-to-utf8> {Assume there are UCS-2, UCS-4, or other multi-byte character encoding here.} </stuff-that-requires-more-complicated-conversion-to-utf8> </doc> And suppose we want to sign one of <doc>'s two child elements. Do we require that the extraction mechanism indicate the character encoding of the content it is giving us? I say yes, because we need to know how to convert the content to UTF-8. One might ask whether, for security (not technical) reasons, the character encoding of the original XML instance needs to be signed. If a signature does not capture the original character encoding, and if one cannot unambiguously determine the character encoding from solely the resultant UTF-8, then does the meaning of what was signed become ambiguous? I say no, because the meaning of the UTF-8 is not ambiguous. Does everyone agree? Does this mean there is no requirement to capture the orignal character encoding in the signature? Regards, Ed
Received on Thursday, 9 September 1999 15:10:25 UTC