- From: Michael Kay <mike@saxonica.com>
- Date: Fri, 31 Dec 2010 10:18:07 +0000
- To: "Costello, Roger L." <costello@mitre.org>
- CC: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
On 30/12/2010 19:32, Costello, Roger L. wrote: > Hi Folks, > > The specification says that this regex matches any numeric character in any language: > > \p[N] > > You mean \p{N} > Can you please provide me with examples of other characters that match? > > Download and unzip http://www.unicode.org/Public/6.0.0/ucdxml/ucd.all.flat.zip into say e:\temp\ucd.all.flat.xml. Then run the following query to find all characters in category Nd (or mutatis mutandis, any other category; for a single-letter category like 'N', match all entries where @gc starts with N) for $c in doc('ucd.all.flat.xml')//*:char[@gc='Nd'] return <c cp="{$c/@cp}" gc="{$c/@gc}" name="{$c/@na}"/> for example by using the command java -Xmx1024m -cp e:\saxon\saxon9he.jar net.sf.saxon.Query -t -q:e:\temp\test.xq !indent=yes (it's a large file so it needs the extra memory). Here's a sample from the output: <c cp="0AE6" name="GUJARATI DIGIT ZERO" gc="Nd"/> <c cp="0AE7" name="GUJARATI DIGIT ONE" gc="Nd"/> <c cp="0AE8" name="GUJARATI DIGIT TWO" gc="Nd"/> <c cp="0AE9" name="GUJARATI DIGIT THREE" gc="Nd"/> <c cp="0AEA" name="GUJARATI DIGIT FOUR" gc="Nd"/> <c cp="0AEB" name="GUJARATI DIGIT FIVE" gc="Nd"/> <c cp="0AEC" name="GUJARATI DIGIT SIX" gc="Nd"/> <c cp="0AED" name="GUJARATI DIGIT SEVEN" gc="Nd"/> <c cp="0AEE" name="GUJARATI DIGIT EIGHT" gc="Nd"/> <c cp="0AEF" name="GUJARATI DIGIT NINE" gc="Nd"/> <c cp="0B66" name="ORIYA DIGIT ZERO" gc="Nd"/> <c cp="0B67" name="ORIYA DIGIT ONE" gc="Nd"/> <c cp="0B68" name="ORIYA DIGIT TWO" gc="Nd"/> <c cp="0B69" name="ORIYA DIGIT THREE" gc="Nd"/> <c cp="0B6A" name="ORIYA DIGIT FOUR" gc="Nd"/> <c cp="0B6B" name="ORIYA DIGIT FIVE" gc="Nd"/> <c cp="0B6C" name="ORIYA DIGIT SIX" gc="Nd"/> <c cp="0B6D" name="ORIYA DIGIT SEVEN" gc="Nd"/> <c cp="0B6E" name="ORIYA DIGIT EIGHT" gc="Nd"/> <c cp="0B6F" name="ORIYA DIGIT NINE" gc="Nd"/> <c cp="0BE6" name="TAMIL DIGIT ZERO" gc="Nd"/> <c cp="0BE7" name="TAMIL DIGIT ONE" gc="Nd"/> <c cp="0BE8" name="TAMIL DIGIT TWO" gc="Nd"/> <c cp="0BE9" name="TAMIL DIGIT THREE" gc="Nd"/> <c cp="0BEA" name="TAMIL DIGIT FOUR" gc="Nd"/> <c cp="0BEB" name="TAMIL DIGIT FIVE" gc="Nd"/> <c cp="0BEC" name="TAMIL DIGIT SIX" gc="Nd"/> <c cp="0BED" name="TAMIL DIGIT SEVEN" gc="Nd"/> <c cp="0BEE" name="TAMIL DIGIT EIGHT" gc="Nd"/> <c cp="0BEF" name="TAMIL DIGIT NINE" gc="Nd"/> <c cp="0C66" name="TELUGU DIGIT ZERO" gc="Nd"/> <c cp="0C67" name="TELUGU DIGIT ONE" gc="Nd"/> <c cp="0C68" name="TELUGU DIGIT TWO" gc="Nd"/> <c cp="0C69" name="TELUGU DIGIT THREE" gc="Nd"/> <c cp="0C6A" name="TELUGU DIGIT FOUR" gc="Nd"/> <c cp="0C6B" name="TELUGU DIGIT FIVE" gc="Nd"/> <c cp="0C6C" name="TELUGU DIGIT SIX" gc="Nd"/> <c cp="0C6D" name="TELUGU DIGIT SEVEN" gc="Nd"/> <c cp="0C6E" name="TELUGU DIGIT EIGHT" gc="Nd"/> <c cp="0C6F" name="TELUGU DIGIT NINE" gc="Nd"/> <c cp="0CE6" name="KANNADA DIGIT ZERO" gc="Nd"/> <c cp="0CE7" name="KANNADA DIGIT ONE" gc="Nd"/> <c cp="0CE8" name="KANNADA DIGIT TWO" gc="Nd"/> <c cp="0CE9" name="KANNADA DIGIT THREE" gc="Nd"/> <c cp="0CEA" name="KANNADA DIGIT FOUR" gc="Nd"/> <c cp="0CEB" name="KANNADA DIGIT FIVE" gc="Nd"/> <c cp="0CEC" name="KANNADA DIGIT SIX" gc="Nd"/> <c cp="0CED" name="KANNADA DIGIT SEVEN" gc="Nd"/> <c cp="0CEE" name="KANNADA DIGIT EIGHT" gc="Nd"/> <c cp="0CEF" name="KANNADA DIGIT NINE" gc="Nd"/> Michael Kay Saxonica
Received on Friday, 31 December 2010 10:18:36 UTC