W3C home > Mailing lists > Public > xmlschema-dev@w3.org > December 2010

Re: Need some examples of characters that match this regex: \p[N]

From: Michael Kay <mike@saxonica.com>
Date: Fri, 31 Dec 2010 10:18:07 +0000
Message-ID: <4D1DADDF.5000502@saxonica.com>
To: "Costello, Roger L." <costello@mitre.org>
CC: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
On 30/12/2010 19:32, Costello, Roger L. wrote:
> Hi Folks,
>
> The specification says that this regex matches any numeric character in any language:
>
>      \p[N]
>
>
You mean \p{N}

> Can you please provide me with examples of other characters that match?
>
>
Download and unzip 
http://www.unicode.org/Public/6.0.0/ucdxml/ucd.all.flat.zip into say 
e:\temp\ucd.all.flat.xml. Then run the following query to find all 
characters in category Nd (or mutatis mutandis, any other category; for 
a single-letter category like 'N', match all entries where @gc starts 
with N)

for $c in doc('ucd.all.flat.xml')//*:char[@gc='Nd']
return <c cp="{$c/@cp}" gc="{$c/@gc}" name="{$c/@na}"/>

for example by using the command

java -Xmx1024m -cp e:\saxon\saxon9he.jar net.sf.saxon.Query -t 
-q:e:\temp\test.xq !indent=yes

(it's a large file so it needs the extra memory).

Here's a sample from the output:

<c cp="0AE6" name="GUJARATI DIGIT ZERO" gc="Nd"/>
<c cp="0AE7" name="GUJARATI DIGIT ONE" gc="Nd"/>
<c cp="0AE8" name="GUJARATI DIGIT TWO" gc="Nd"/>
<c cp="0AE9" name="GUJARATI DIGIT THREE" gc="Nd"/>
<c cp="0AEA" name="GUJARATI DIGIT FOUR" gc="Nd"/>
<c cp="0AEB" name="GUJARATI DIGIT FIVE" gc="Nd"/>
<c cp="0AEC" name="GUJARATI DIGIT SIX" gc="Nd"/>
<c cp="0AED" name="GUJARATI DIGIT SEVEN" gc="Nd"/>
<c cp="0AEE" name="GUJARATI DIGIT EIGHT" gc="Nd"/>
<c cp="0AEF" name="GUJARATI DIGIT NINE" gc="Nd"/>
<c cp="0B66" name="ORIYA DIGIT ZERO" gc="Nd"/>
<c cp="0B67" name="ORIYA DIGIT ONE" gc="Nd"/>
<c cp="0B68" name="ORIYA DIGIT TWO" gc="Nd"/>
<c cp="0B69" name="ORIYA DIGIT THREE" gc="Nd"/>
<c cp="0B6A" name="ORIYA DIGIT FOUR" gc="Nd"/>
<c cp="0B6B" name="ORIYA DIGIT FIVE" gc="Nd"/>
<c cp="0B6C" name="ORIYA DIGIT SIX" gc="Nd"/>
<c cp="0B6D" name="ORIYA DIGIT SEVEN" gc="Nd"/>
<c cp="0B6E" name="ORIYA DIGIT EIGHT" gc="Nd"/>
<c cp="0B6F" name="ORIYA DIGIT NINE" gc="Nd"/>
<c cp="0BE6" name="TAMIL DIGIT ZERO" gc="Nd"/>
<c cp="0BE7" name="TAMIL DIGIT ONE" gc="Nd"/>
<c cp="0BE8" name="TAMIL DIGIT TWO" gc="Nd"/>
<c cp="0BE9" name="TAMIL DIGIT THREE" gc="Nd"/>
<c cp="0BEA" name="TAMIL DIGIT FOUR" gc="Nd"/>
<c cp="0BEB" name="TAMIL DIGIT FIVE" gc="Nd"/>
<c cp="0BEC" name="TAMIL DIGIT SIX" gc="Nd"/>
<c cp="0BED" name="TAMIL DIGIT SEVEN" gc="Nd"/>
<c cp="0BEE" name="TAMIL DIGIT EIGHT" gc="Nd"/>
<c cp="0BEF" name="TAMIL DIGIT NINE" gc="Nd"/>
<c cp="0C66" name="TELUGU DIGIT ZERO" gc="Nd"/>
<c cp="0C67" name="TELUGU DIGIT ONE" gc="Nd"/>
<c cp="0C68" name="TELUGU DIGIT TWO" gc="Nd"/>
<c cp="0C69" name="TELUGU DIGIT THREE" gc="Nd"/>
<c cp="0C6A" name="TELUGU DIGIT FOUR" gc="Nd"/>
<c cp="0C6B" name="TELUGU DIGIT FIVE" gc="Nd"/>
<c cp="0C6C" name="TELUGU DIGIT SIX" gc="Nd"/>
<c cp="0C6D" name="TELUGU DIGIT SEVEN" gc="Nd"/>
<c cp="0C6E" name="TELUGU DIGIT EIGHT" gc="Nd"/>
<c cp="0C6F" name="TELUGU DIGIT NINE" gc="Nd"/>
<c cp="0CE6" name="KANNADA DIGIT ZERO" gc="Nd"/>
<c cp="0CE7" name="KANNADA DIGIT ONE" gc="Nd"/>
<c cp="0CE8" name="KANNADA DIGIT TWO" gc="Nd"/>
<c cp="0CE9" name="KANNADA DIGIT THREE" gc="Nd"/>
<c cp="0CEA" name="KANNADA DIGIT FOUR" gc="Nd"/>
<c cp="0CEB" name="KANNADA DIGIT FIVE" gc="Nd"/>
<c cp="0CEC" name="KANNADA DIGIT SIX" gc="Nd"/>
<c cp="0CED" name="KANNADA DIGIT SEVEN" gc="Nd"/>
<c cp="0CEE" name="KANNADA DIGIT EIGHT" gc="Nd"/>
<c cp="0CEF" name="KANNADA DIGIT NINE" gc="Nd"/>

Michael Kay
Saxonica
Received on Friday, 31 December 2010 10:18:36 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:56:18 UTC