- From: MURATA Makoto <mmurata@trl.ibm.co.jp>
- Date: Thu, 24 Jan 2002 16:02:42 +0900 (JST)
- To: www-xml-blueberry-comments@w3.org
- cc: mmurata@trl.ibm.co.jp
In the current draft of XML 1.1, many characters affected by character normalization are allowed as name characters. I find that normalization applied to XML documents may lead to strange results, when such characters are used as tag names, entity names, or identifiers. I do not know if this is a problem. I am just reporting what could happen. Hereafter, I use \uXXXX so as to represent a Unicode character of the cope point XXXX. The following examples can be converted to UTF-8 by using native2ascii -reverse -encoding utf-8 Case 1: Wellformed -> non-well-formed First, some well-formed documents become non-well-formed. Example 1: <?xml version="1.0"?> <test \u00c1="" \u0041\u0301=""/> Well, this example may be too artificial. But the next one is more realistic. Example 2: <?xml version="1.0"?> <!DOCTYPE test SYSTEM "welltoill2.dtd"> <test \u00c1=""/> where welltoill2.dtd is: <!ATTLIST test \u0041\u0301 CDATA "default"> Case 2: Non-well-formed -> Well-formed Next, some non-well-formed documents become well-formed documents. Example 3: <?xml version="1.0"?> <\u00c1></\u0041\u0301> Example 4: <?xml version="1.0"?> <!DOCTYPE t [ <!ENTITY \u00c1 ""> ]> <t> &\u0041\u0301; </t> Both examples are artificial. But if the entity declaration is moved to an external DTD subset, Example 4 becomes more realistic. Case 3: valid -> invalid Next, some valid documents become invalid. Example 5: <?xml version="1.0"?> <!DOCTYPE test [ <!ELEMENT test (p*)> <!ELEMENT p EMPTY> <!ATTLIST p id ID #REQUIRED> <test> <p id="\u00c1"/> <p id="\u0041\u0301"/> </test> Again, this example is artifical. But if we create an external parsed entity containing the second <p> only, it becomes more realistic. Case 4: invalid -> valid Next, some invalid documents become valid. Example 6: <?xml version="1.0"?> <!DOCTYPE \u00c1 [ <!ELEMENT \u00c1 EMPTY> ]> <\u0041\u0301/> Again, the DTD can be made external. Example 7: <?xml version="1.0"?> <!DOCTYPE t SYSTEM "invalidtovalid2.dtd"> <t> &\u0041\u0301; </t> where invalidtovalid2.dtd is: <!ENTITY \u00c1 ""> Case 5: well-formed -> well-formed with a different infoset. Finally, some well-formed documents remain well-formed, but the infoset become different. Example 8: <?xml version="1.0"?> <!DOCTYPE t [ <!ENTITY \u00c1 "first"> <!ENTITY \u0041\u0301 "second"> ]> <t> &\u0041\u0301; </t> I know that many W3C specs assume early normalization. One could thus argue that we do not have to worry about examples shown above. However, it is not hard to imagine that some (many?) XML documents are not normalized. I am just worried (but do not know) if name characters of XML 1.1 make the situation worse. Cheers, Makoto
Received on Thursday, 24 January 2002 02:07:46 UTC