- From: Taki Kamiya <tkamiya@us.fujitsu.com>
- Date: Wed, 29 Oct 2008 17:11:58 -0700
- To: "'Yuri Delendik'" <yury_exi@yahoo.com>, <public-exi-comments@w3.org>
Hi Yuri, Thank you for pointing this out. It has been overlooked in the spec, and we appreciate that you catched it. We will mention in the spec that only BMP characters indicated by each catrgory are included in the set of characters for use in restricted character set computation. This should make '\d' still relevant, because the category "Nd" contains only 230 BMP characters. Shown down below is the number of characters (both total and BMP) contained in each category, derived from version 5.0.0 of Unicode. Based on this, category names that cause to stop the computation should now consist of the followings. 'L'[ulo]?, 'M'[n]?, 'N', 'P'[o]?, 'S'[mo]? or 'C'[o]? . Thanks! 65 characters in Cc (65 BMP chars) 138 characters in Cf (33 BMP chars) 137468 characters in Co (6400 BMP chars) x 2048 characters in Cs (2048 BMP chars) x 1634 characters in Ll (1102 BMP chars) x 167 characters in Lm (167 BMP chars) 89344 characters in Lo (44681 BMP chars) x 31 characters in Lt (31 BMP chars) 1320 characters in Lu (836 BMP chars) x 175 characters in Mc (167 BMP chars) 10 characters in Me (10 BMP chars) 880 characters in Mn (602 BMP chars) x 290 characters in Nd (230 BMP chars) 210 characters in Nl (51 BMP chars) 336 characters in No (252 BMP chars) 10 characters in Pc (10 BMP chars) 18 characters in Pd (18 BMP chars) 65 characters in Pe (65 BMP chars) 9 characters in Pf (9 BMP chars) 11 characters in Pi (11 BMP chars) 278 characters in Po (260 BMP chars) x 66 characters in Ps (66 BMP chars) 41 characters in Sc (41 BMP chars) 99 characters in Sk (99 BMP chars) 914 characters in Sm (904 BMP chars)x 2958 characters in So (2350 BMP chars) x 1 characters in Zl (1 BMP chars) 1 characters in Zp (1 BMP chars) 18 characters in Zs (18 BMP chars) -taki -----Original Message----- From: public-exi-comments-request@w3.org [mailto:public-exi-comments-request@w3.org] On Behalf Of Yuri Delendik Sent: Wednesday, September 24, 2008 8:23 PM To: public-exi-comments@w3.org Subject: [EXI] Unicode 5.0.0 and Regular Expressions charset Hello, From 7.1.10.1 Restricted Character Sets: "... If the restricted character set for a datatype contains at least 255 characters or contains non-BMP characters, the character set of the datatype is not restricted and can be omitted from further consideration..." Appendix E Deriving Character Sets from XML Schema Regular Expressions explains how to build character sets. It enumerates character groups that if they are contained in regular expression atom, the charset of the whole expression is defined to be the entire set of XML characters. One of the exceptions is multi-character escape "\d". By XSD definition it is equivalent to category escape "\p{Nd}". But according Unicode 5.0.0's UnicodeData.txt data file this category contains 290 characters (230 BMP and 60 non-BMP). The exception of "\d" (and "\p{Nd}") is in correct: after all processing the expression "\d" becomes non-suitable for datatype encoding using restricted character set since the set has more than 255 and contains non-BMP characters. There are a totals from UnicodeData.txt: Category BMP non-BMP Total chars Excl.in EXI \p{Cc} 65 0 65 \p{Cf} 33 105 138 ? \p{Co} 2 4 6 X \p{Cs} 6 0 6 \p{Ll} 1102 532 1634 X \p{Lm} 167 0 167 \p{Lo} 6009 1954 7963 X \p{Lt} 31 0 31 \p{Lu} 836 484 1320 X \p{Mc} 167 8 175 ? \p{Me} 10 0 10 \p{Mn} 602 278 880 X \p{Nd} 230 60 290 ? \p{Nl} 51 159 210 ? \p{No} 252 84 336 ? \p{Pc} 10 10 \p{Pd} 18 0 18 \p{Pe} 65 0 65 \p{Pf} 9 0 9 \p{Pi} 11 0 11 \p{Po} 260 18 278 ? \p{Ps} 66 0 66 \p{Sc} 41 0 41 \p{Sk} 99 0 99 \p{Sm} 904 10 914 X \p{So} 2350 608 2958 X \p{Zl} 1 0 1 \p{Zp} 1 0 1 \p{Zs} 18 0 18 Regards, Yuri Delendik
Received on Thursday, 30 October 2008 00:12:42 UTC