Eleventh hour check on XML 1.1 names

I wrote some code to do a last-minute check of XML 1.1 before it gets
published again.  We've taken into account recommendations from Unicode
and W3C i18n, and hope we've got it done with.  Therefore, comments
are urgently solicited if they represent things that people think are
show-stoppers.

There are three relevant sets of name characters (neglecting here the
difference between name-characters proper and name-start characters):
XML 1.0, the "recommended" set in XML 1.1 Appendix B and the "permitted"
set in XML 1.1 production 4.  Production 4 has changed since the April
draft, so don't try to check my results.

The "recommended" set is essentially the Unicode 3.0 rules applied to
the Unicode 3.2 tables, with the following four exceptions:

	ideographic characters with canonical decompositions
	characters with compatibility decompositions
	combining characters for symbols
	interlinear annotation characters

We guarantee that the XML 1.0 set is a subset of the XML 1.1 set
(modulo normalization concerns).  To a first approximation, the 1.0
set should be a subset of the 1.1 "recommended" set.  Furthermore, the
"recommended" rules applied to the Unicode 3.2 repertoire should be,
to a first approximation, a subset of the 1.1 "permitted" set.

Needless to say, neither of these first approximations is quite true.
This posting details the deviations.

The following characters were explicitly permitted by XML 1.0 but are
not in the "recommended" 1.1 set:

U+0387 GREEK ANO TELEIA
U+03D1 U+03D2 U+03D5 U+03D6 U+03F0 U+03F1 U+03F2 Greek letter-like symbols
U+0675..U+06D8 Arabic "high hamza" letters
U+06DE ARABIC START OF RUB EL HIZB
U+0E33 U+03B3 THAI/LAO (SARA) AM
U+0F77 U+0F79 TIBETAN VOWEL SIGN VOCALIC RR/LL
U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
U+212E ESTIMATED SYMBOL

If anyone thinks that any of the above characters desperately need to
be in the "recommended" set, please speak up now!

The following characters are not currently permitted by XML 1.1, but
would be in the "recommended" set if they were permitted:

U+200E U+200F LEFT-TO-RIGHT/RIGHT-TO-LEFT MARK
U+202A..U+202E more bidi controls
U+203F U+2040 UNDER/CHARACTER TIE
U+2060..U+2063 new Cf characters
U+206A..U+206F deprecated Cf characters
The U source ideographs.
Hebrew characters with points in the FB1D..FB4E range.
BMP variation selectors.
Halves of double combining marks.
U+FE73 ARABIC TAIL FRAGMENT
U+FEFF ZWNBSP

If anyone thinks these characters desperately need to be in the
XML 1.1 "permitted" set, speak up now!

Thanks.

-- 
John Cowan                              <jcowan@reutershealth.com>
http://www.ccil.org/~cowan              http://www.reutershealth.com
                Charles li reis, nostre emperesdre magnes,
                Set anz totz pleinz ad ested in Espagnes.

Received on Monday, 12 August 2002 20:41:37 UTC