[EXI] Unicode 5.0.0 and Regular Expressions charset

Hello,

From 7.1.10.1 Restricted Character Sets:
"... If the restricted character set for a datatype contains at least 255 characters or contains non-BMP characters, the character set of the datatype is not restricted and can be omitted from further consideration..."

Appendix E Deriving Character Sets from XML Schema Regular Expressions explains how to build character sets. It enumerates character groups that if they are contained in regular expression atom, the charset of the whole expression is defined to be the entire set of XML characters. One of the exceptions is multi-character escape "\d". By XSD definition it is equivalent to category escape "\p{Nd}". But according Unicode 5.0.0's UnicodeData.txt data file this category contains 290 characters (230 BMP and 60 non-BMP).

The exception of "\d" (and "\p{Nd}") is in correct: after all processing the expression "\d" becomes non-suitable for datatype encoding using restricted character set since the set has more than 255 and contains non-BMP characters.

There are a totals from UnicodeData.txt:
Category      BMP        non-BMP   Total chars Excl.in EXI
\p{Cc}           65         0            65                 
\p{Cf}           33       105           138                ?
\p{Co}            2         4             6      X       
\p{Cs}            6         0             6                 
\p{Ll}         1102       532          1634      X       
\p{Lm}          167         0           167                 
\p{Lo}         6009      1954          7963      X       
\p{Lt}           31         0            31                 
\p{Lu}          836       484          1320      X       
\p{Mc}          167         8           175                ?
\p{Me}           10         0            10                 
\p{Mn}          602       278           880      X       
\p{Nd}          230        60           290                ?
\p{Nl}           51       159           210                ?
\p{No}          252        84           336                ?
\p{Pc}           10         0            10                 
\p{Pd}           18         0            18                 
\p{Pe}           65         0            65                 
\p{Pf}            9         0             9                 
\p{Pi}           11         0            11                 
\p{Po}          260        18           278                ?
\p{Ps}           66         0            66                 
\p{Sc}           41         0            41                 
\p{Sk}           99         0            99                 
\p{Sm}          904        10           914      X      
\p{So}         2350       608          2958      X     
\p{Zl}            1         0             1                 
\p{Zp}            1         0             1                 
\p{Zs}           18         0            18                 
Regards,
Yuri Delendik

Received on Thursday, 25 September 2008 03:23:25 UTC