- From: John Boyer <boyerj@ca.ibm.com>
- Date: Wed, 13 Feb 2008 15:29:22 -0800
- To: xml-editor@w3.org
- Message-ID: <OF2169DD41.2C4E8FD0-ON882573EE.007F9DC4-882573EE.008107AC@ca.ibm.com>
Reading the status of the document, one would believe that the erratum E9 change to the characters allowed in tag names and attribute names is the only substantive change. But E11 seems to increase the number of available characters in actual content by increasing from Unicode 3.x to Unicode 5. Some have commented that they believed the sentence "XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1" meant that encodings for characters not in Unicode 3.1 were not allowed. I don't read it that harshly, but I can see how they would claim that characters not in Unicode 3.1 should be avoided in content because XML processors are not required to support them, so interop cannot be guaranteed. Now the sentence is changing so that Unicode 3.1 is effectively being replaced with Unicode 5. Wouldn't it be easier to nip this in the bud now by converting an UTF-8 encoding into the corresponding 32-bit value, regardless of whether or not it maps to something in Unicode K (where K>=5)? Then, you could say which of those 32-bit values are illegal (e.g. the permanently undefined Unicode characters), and which should be avoided (e.gt. the compatibility characters). John M. Boyer, Ph.D. Senior Technical Staff Member Lotus Forms Architect and Researcher Chair, W3C Forms Working Group Workplace, Portal and Collaboration Software IBM Victoria Software Lab E-Mail: boyerj@ca.ibm.com Blog: http://www.ibm.com/developerworks/blogs/page/JohnBoyer Blog RSS feed: http://www.ibm.com/developerworks/blogs/rss/JohnBoyer?flavor=rssdw
Received on Wednesday, 13 February 2008 23:29:48 UTC