XML 1.0 5th Ed. PER: Unicode upgrade from John Boyer on 2008-02-13 (xml-editor@w3.org from January to March 2008)

From: John Boyer <boyerj@ca.ibm.com>
Date: Wed, 13 Feb 2008 15:29:22 -0800
To: xml-editor@w3.org
Message-ID: <OF2169DD41.2C4E8FD0-ON882573EE.007F9DC4-882573EE.008107AC@ca.ibm.com>

Reading the status of the document, one would believe that the erratum E9 
change to the characters allowed in tag names and attribute names is the 
only substantive change.

But E11 seems to increase the number of available characters in actual 
content by increasing from Unicode 3.x to Unicode 5.

Some have commented that they believed the sentence "XML processors MUST 
accept the UTF-8 and UTF-16 encodings of Unicode 3.1" meant that encodings 
for characters not in Unicode 3.1 were not allowed.  I don't read it that 
harshly, but I can see how they would claim that characters not in Unicode 
3.1 should be avoided in content because XML processors are not required 
to support them, so interop cannot be guaranteed.

Now the sentence is changing so that Unicode 3.1 is effectively being 
replaced with Unicode 5.  Wouldn't it be easier to nip this in the bud now 
by converting an UTF-8 encoding into the corresponding 32-bit value, 
regardless of whether or not it maps to something in Unicode K (where 
K>=5)?  Then, you could say which of those 32-bit values are illegal (e.g. 
the permanently undefined Unicode characters), and which should be avoided 
(e.gt. the compatibility characters).

John M. Boyer, Ph.D.
Senior Technical Staff Member
Lotus Forms Architect and Researcher
Chair, W3C Forms Working Group
Workplace, Portal and Collaboration Software
IBM Victoria Software Lab
E-Mail: boyerj@ca.ibm.com 

Blog: http://www.ibm.com/developerworks/blogs/page/JohnBoyer
Blog RSS feed: 
http://www.ibm.com/developerworks/blogs/rss/JohnBoyer?flavor=rssdw

Received on Wednesday, 13 February 2008 23:29:48 UTC