W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > January to March 2010

[Bug 8744] New: Regex characters classes C, L, M, etc

From: <bugzilla@wiggum.w3.org>
Date: Thu, 14 Jan 2010 12:38:59 +0000
To: www-xml-schema-comments@w3.org
Message-ID: <bug-8744-703@http.www.w3.org/Bugs/Public/>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=8744

           Summary: Regex characters classes C, L, M, etc
           Product: XML Schema
           Version: 1.0/1.1 both
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Datatypes: XSD Part 2
        AssignedTo: David_E3@VERIFONE.com
        ReportedBy: mike@saxonica.com
         QAContact: www-xml-schema-comments@w3.org
                CC: cmsmcq@blackmesatech.com


The specification states:

<quote>
[Definition:]  [Unicode Database] specifies a number of possible values for the
"General Category" property and provides mappings from code points to specific
character properties.  The set containing all characters that have property X,
can be identified with a category escape  \p{X} .  The complement of this set
is specified with the category escape  \P{X} .  ( [\P{X}] = [^\p{X}] ).
</quote>

It then gives a table purporting to show the values of "General Category" that
occur in Unicode 5.1. This includes single-character categories such as "C",
"L", and "M". As far as I can see, however, Unicode only defines the
two-character categories such as Ll, Lu, Mc and so on. The single-character
categories are an invention of the regex language, and therefore need to be
described in our specification, rather than by reference to Unicode.

There are two possible definitions of these categories, which give different
results.

At least one XML Schema implementation has interpreted the single-character
category X to be the union of all two-character categories starting with X, for
example C is the union of (Cc, Cf, Co, and Cn). However, another interpretation
(the one used by the Java regex library) is that it is the set of all
characters listed in the Unicode database as belonging to a category starting
with that letter. This gives a different result in the case of category C,
since Cn is the set of characters that are not listed in the relevant section
of the Unicode database.


-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 14 January 2010 12:39:00 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 14 January 2010 12:39:01 GMT