- From: Yuri Delendik <yury_exi@yahoo.com>
- Date: Wed, 24 Sep 2008 20:22:44 -0700 (PDT)
- To: public-exi-comments@w3.org
Hello, From 7.1.10.1 Restricted Character Sets: "... If the restricted character set for a datatype contains at least 255 characters or contains non-BMP characters, the character set of the datatype is not restricted and can be omitted from further consideration..." Appendix E Deriving Character Sets from XML Schema Regular Expressions explains how to build character sets. It enumerates character groups that if they are contained in regular expression atom, the charset of the whole expression is defined to be the entire set of XML characters. One of the exceptions is multi-character escape "\d". By XSD definition it is equivalent to category escape "\p{Nd}". But according Unicode 5.0.0's UnicodeData.txt data file this category contains 290 characters (230 BMP and 60 non-BMP). The exception of "\d" (and "\p{Nd}") is in correct: after all processing the expression "\d" becomes non-suitable for datatype encoding using restricted character set since the set has more than 255 and contains non-BMP characters. There are a totals from UnicodeData.txt: Category BMP non-BMP Total chars Excl.in EXI \p{Cc} 65 0 65 \p{Cf} 33 105 138 ? \p{Co} 2 4 6 X \p{Cs} 6 0 6 \p{Ll} 1102 532 1634 X \p{Lm} 167 0 167 \p{Lo} 6009 1954 7963 X \p{Lt} 31 0 31 \p{Lu} 836 484 1320 X \p{Mc} 167 8 175 ? \p{Me} 10 0 10 \p{Mn} 602 278 880 X \p{Nd} 230 60 290 ? \p{Nl} 51 159 210 ? \p{No} 252 84 336 ? \p{Pc} 10 0 10 \p{Pd} 18 0 18 \p{Pe} 65 0 65 \p{Pf} 9 0 9 \p{Pi} 11 0 11 \p{Po} 260 18 278 ? \p{Ps} 66 0 66 \p{Sc} 41 0 41 \p{Sk} 99 0 99 \p{Sm} 904 10 914 X \p{So} 2350 608 2958 X \p{Zl} 1 0 1 \p{Zp} 1 0 1 \p{Zs} 18 0 18 Regards, Yuri Delendik
Received on Thursday, 25 September 2008 03:23:25 UTC