RE: [EXI] Unicode 5.0.0 and Regular Expressions charset

Hi Yuri,

Thank you for pointing this out. It has been overlooked in the spec, and
we appreciate that you catched it.

We will mention in the spec that only BMP characters indicated by
each catrgory are included in the set of characters for use in restricted
character set computation. This should make '\d' still relevant, because
the category "Nd" contains only 230 BMP characters.

Shown down below is the number of characters (both total and BMP) contained
in each category, derived from version 5.0.0 of Unicode. Based on this,
category names that cause to stop the computation should now consist of
the followings.

'L'[ulo]?, 'M'[n]?, 'N', 'P'[o]?, 'S'[mo]? or 'C'[o]? .

Thanks!


65 characters in Cc (65 BMP chars)
138 characters in Cf (33 BMP chars)
137468 characters in Co (6400 BMP chars) x
2048 characters in Cs (2048 BMP chars) x
1634 characters in Ll (1102 BMP chars) x
167 characters in Lm (167 BMP chars)
89344 characters in Lo (44681 BMP chars) x
31 characters in Lt (31 BMP chars)
1320 characters in Lu (836 BMP chars) x
175 characters in Mc (167 BMP chars)
10 characters in Me (10 BMP chars)
880 characters in Mn (602 BMP chars) x
290 characters in Nd (230 BMP chars)
210 characters in Nl (51 BMP chars)
336 characters in No (252 BMP chars)
10 characters in Pc (10 BMP chars)
18 characters in Pd (18 BMP chars)
65 characters in Pe (65 BMP chars)
9 characters in Pf (9 BMP chars)
11 characters in Pi (11 BMP chars)
278 characters in Po (260 BMP chars) x
66 characters in Ps (66 BMP chars)
41 characters in Sc (41 BMP chars)
99 characters in Sk (99 BMP chars)
914 characters in Sm (904 BMP chars)x
2958 characters in So (2350 BMP chars) x
1 characters in Zl (1 BMP chars)
1 characters in Zp (1 BMP chars)
18 characters in Zs (18 BMP chars)

-taki
 

-----Original Message-----
From: public-exi-comments-request@w3.org [mailto:public-exi-comments-request@w3.org] On Behalf Of Yuri Delendik
Sent: Wednesday, September 24, 2008 8:23 PM
To: public-exi-comments@w3.org
Subject: [EXI] Unicode 5.0.0 and Regular Expressions charset


Hello,

From 7.1.10.1 Restricted Character Sets:
"... If the restricted character set for a datatype contains at least 255 characters or contains non-BMP characters, the character
set of the datatype is not restricted and can be omitted from further consideration..."

Appendix E Deriving Character Sets from XML Schema Regular Expressions explains how to build character sets. It enumerates character
groups that if they are contained in regular expression atom, the charset of the whole expression is defined to be the entire set of
XML characters. One of the exceptions is multi-character escape "\d". By XSD definition it is equivalent to category escape
"\p{Nd}". But according Unicode 5.0.0's UnicodeData.txt data file this category contains 290 characters (230 BMP and 60 non-BMP).

The exception of "\d" (and "\p{Nd}") is in correct: after all processing the expression "\d" becomes non-suitable for datatype
encoding using restricted character set since the set has more than 255 and contains non-BMP characters.

There are a totals from UnicodeData.txt:
Category      BMP        non-BMP   Total chars Excl.in EXI
\p{Cc}           65         0            65                 
\p{Cf}           33       105           138                ?
\p{Co}            2         4             6      X       
\p{Cs}            6         0             6                 
\p{Ll}         1102       532          1634      X       
\p{Lm}          167         0           167                 
\p{Lo}         6009      1954          7963      X       
\p{Lt}           31         0            31                 
\p{Lu}          836       484          1320      X       
\p{Mc}          167         8           175                ?
\p{Me}           10         0            10                 
\p{Mn}          602       278           880      X       
\p{Nd}          230        60           290                ?
\p{Nl}           51       159           210                ?
\p{No}          252        84           336                ?
\p{Pc}           10      
  10                 
\p{Pd}           18         0            18                 
\p{Pe}           65         0            65                 
\p{Pf}            9         0             9                 
\p{Pi}           11         0            11                 
\p{Po}          260        18           278                ?
\p{Ps}           66         0            66                 
\p{Sc}           41         0            41                 
\p{Sk}           99         0            99                 
\p{Sm}          904        10           914      X      
\p{So}         2350       608          2958      X     
\p{Zl}            1         0             1                 
\p{Zp}            1         0             1                 
\p{Zs}           18         0            18                 
Regards,
Yuri Delendik

Received on Thursday, 30 October 2008 00:12:42 UTC